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Abstract 

Submodular functions are discrete functions that model laws of diminishing returns and enjoy numer- 
ous algorithmic applications. They have been used in many areas, including combinatorial optimization, 
machine learning, and economics. In this work we study submodular functions from a learning theoretic 
angle. We provide algorithms for learning submodular functions, as well as lower bounds on their learn- 
ability. In doing so, we uncover several novel structural results revealing ways in which submodular 
functions can be both surprisingly structured and surprisingly unstructured. We provide several con- 
crete implications of our work in other domains including algorithmic game theory and combinatorial 
optimization. 

At a technical level, this research combines ideas from many areas, including learning theory (dis- 
tributional learning and PAC-style analyses), combinatorics and optimization (matroids and submodular 
functions), and pseudorandomness (lossless expander graphs). 

1 Introduction 

Submodular functions are a discrete analog of convex functions that enjoy numerous applications and have 
structural properties that can be exploited algorithmically. They arise naturally in the study of graphs, 
matroids, covering problems, facility location problems, etc., and they have been extensively studied in op- 
erations research and combinatorial optimization for many years l22l . More recently, submodular functions 
have become key concepts in other areas including machine learning, algorithmic game theory, and social 
sciences. For example, submodular functions have been used to model bidders' valuation functions in com- 
binatorial auctions ll40ll63ll20l l6ll85Tl. and for solving several machine learning problems, including feature 
selection problems in graphical models ll57ll and various clustering problems irTTTl . 

In this work we use a learning theory perspective to uncover new structural properties of submodular 
functions. In addition to providing algorithms and lower bounds for learning submodular functions, we 
discuss numerous implications of our work in algorithmic game theory, economics, matroid theory and 
combinatorial optimization. 

One of our foremost contributions is to provide the first known results about learnability of submodular 
functions in a distributional (i.e., PAC-style) learning setting. Informally, such a setting has a fixed but 
unknown submodular function /* and a fixed but unknown distribution over the domain of /* . The goal is 
to design an efficient algorithm which provides a good approximation of /* with respect to that distribution, 
given only a small number of samples from the distribution. 

Formally, let [n] = {1, . . . , n} denote a ground set of items and let 2^ be the power set of [n]. A function 
/ : 2 ["J — > R is submodular if it satisfies 

f(Tu{x})-f(T) < f(Su{x})-f(S) VSQTC[n],x€[n]. 

* A preliminary version of this paper appeared in the 43rd ACM Symposium on Theory of Computing under the title "Learning 
Submodular Functions". 
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The goal is to output a function / that, with probability 1 — 5 over the samples, is a good approximation 
of /* on most of the sets coming from the distribution. Here "most" means a 1 — e fraction and "good 
approximation" means that f(S) < f*(S) < a ■ f(S) for some approximation factor a. We prove nearly 
matching a = 0{n 1 / 2 ) upper and a = ^(ra 1 / 3 ) lower bounds on the approximation factor achievable when 
the algorithm receives only poly(n, 1/e, 1/5) examples from an arbitrary (fixed but unknown) distribution. 
We additionally provide a learning algorithm with constant approximation factor for the case that the un- 
derlying distribution is a product distribution. This is based on a new result proving strong concentration 
properties of submodular functions. 

To prove the ^(n 1 / 3 ) lower bound for learning under arbitrary distributions, we construct a new family 
of matroids whose rank functions are fiendishly unstructured. Since matroid rank functions are submod- 
ular, this shows unexpected extremal properties of submodular functions and gives new insights into their 
complexity. This construction also provides a general tool for proving lower bounds in several areas where 
submodular functions arise. We discuss and derive such implications in: 

• Algorithmic Game Theory and Economics: An important consequence of our construction is that 
matroid rank functions do not have a "sketch", i.e., a concise, approximate representation. As ma- 
troid rank functions can be shown to satisfy the gross substitutes property (70), our work implies that 
gross substitutes functions also do not have a concise, approximate representation. This provides a 
surprising answer to an open question in economics [|9] iflOl Section 6.2.1]. 

• Combinatorial Optimization: Many optimization problems involving submodular functions, such as 
submodular function minimization, are very well behaved and their optimal solutions have a rich 
structure. In contrast, we show that, for several other submodular optimization problems which have 
been considered recently in the literature, including submodular s-t min cut and submodular vertex 
cover, their optimal solutions are very unstructured, in the sense that the optimal solutions do not 
have a succinct representation, or even a succinct, approximate representation. 

Although our new family of matroids proves that matroid rank functions (and more generally submodular 
functions) are surprisingly unstructured, our concentration result for submodular functions shows that, in a 
different sense, matroid rank functions (and other sufficiently "smooth" submodular functions) are surpris- 
ingly structured. 

Submodularity has been an increasingly useful tool in machine learning in recent years. For example, it 
has been used for feature selection problems in graphical models ll57l and various clustering problems [71]. 
In fact, submodularity has been the topic of several tutorials and workshops at recent major conferences in 
machine learning [Q~l[58j|59l|2]. Nevertheless, our work is the first to use a learning theory perspective to 
derive new structural results for submodular functions and related structures (including matroids), thereby 
yielding implications in many other areas. Our work also potentially has useful applications — our learning 
algorithms can be employed in many areas where submodular functions arise (e.g., medical decision making 



and economics). We discuss such applications in Section 1.2 Furthermore, our work defines a new learning 
model for approximate distributional learning that could be useful for analyzing learnability of other inter- 
esting classes of real- valued functions. In fact, this model has already been used to analyze the learnability 



of several classes of set functions widely used in economics — see Section 1.1.2 and Section 8.1 
1.1 Our Results and Techniques 

The central topic of this paper is proving new structural results for submodular functions, motivated by 
learnability considerations. In the following we provide a more detailed description of our results. For ease 
of exposition, we start by describing our new structural results, then present our learning model and our 
learnability results within this model, and finally we describe implications of our results in various areas. 
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1.1.1 New Structural Results 



A new matroid construction The first result in this paper is the construction of a family of submodular 
functions with interesting technical properties. These functions are the key ingredient in our lower bounds 
for learning and property testing of submodular functions, inapproximability results for submodular op- 
timization problems, and the non-existence of succinct, approximate representations for gross substitutes 
functions. 

Designing submodular functions directly is difficult because there is very little tangible structure to work 
with. It turns out to be more convenient to work with matroid^ because every matroid has an associated 
submodular function (its rank function) and because matroids are a very rich class of combinatorial objects 
with numerous well-understood properties. 

Our goal is to find a collection of subsets of [n] and two values rhigh and r\ ow such that, for any labeling 
of these subsets as either HIGH or Low, we can construct a matroid for which each set labeled HIGH has 
rank value and each set labeled Low has rank value r\ ow . We would like both the size of the collection 
and the ratio rhi g h/ri ow to be as large as possible. 

Unfortunately existing matroid constructions can only achieve this goal with very weak parameters; for 
further discussion of existing matroids, see Section |1.3| Our new matroid construction, which involves 
numerous technical steps, achieves this goal with the collection of size super-polynomial in n and the ratio 
rhigh/now = ^(n 1 / 3 ). This shows that matroid rank functions can be fiendishly unstructured — in our 
construction, knowing the value of the rank function on all-but-one of the sets in the collection does not 
determine the rank value on the remaining set, even to within a multiplicative factor ^(ra 1 / 3 ). 

More formally, let the collection of sets be A\, . . . , C [n] where each \Ai\ = rhigh- For every set of 
indices B C {1, . . . ,k} there is a matroid Mg whose associated rank function r# : 2^ — > E has the form 

r B (S) = maxi \I (1 S\ : I n \J Aj\ < r low • |J| - ^}AA + I \J A 5 \ VJCB, \J\<r 

' (1.1) 

We show that, if the sets Ai satisfy a strong expansion property, in the sense that they are nearly disjoint, 
and the parameters rhigh, How, t are carefully chosen, then this function satisfies rs{Ai) = r\ ow whenever 
i G B and re(ylj) = rhigh whenever % B. 

Concentration of Submodular Functions A major theme in probability theory is proving concentration 
bounds for a function / : 2^ — > M>o under product distribution^] For example, when / is linear, the 
Chernoff-Hoeffding bound is applicable. For arbitrary /, the McDiarmid inequality is applicable. The 
quality of these bounds also depends on the "smoothness" of /, which is quantified using the Lipschitz 
constant L := max SA \f(S U {*}) - f(S)\. 

We show that McDiarmid's tail bound can be strengthened under the additional assumption that the func- 
tion is monotone and submodular. For a 1-Lipschitz function (i.e., L = 1), McDiarmid's inequality gives 
concentration comparable to that of a Gaussian random variable with standard deviation sfn. For example, 
the probability that the value of / is y/n less than its expectation is bounded above by a constant. Such 
a bound is quite weak when the expectation of / is significantly less than y/n, because it says that the 
probability of / being negative is at most a constant, even though that probability is actually zero. 

Using Talagrand's inequality, we show that 1-Lipschitz, monotone, submodular functions are extremely 
tightly concentrated around their expected value. The quality of concentration that we show is similar to 
Chernoff-Hoeffding bounds — importantly, it depends only on the expected value of the function, and not 



1 For the reader unfamiliar with matroids, a brief introduction to them is given in Section 2.2 For the present discussion, the 
only fact that we need about matroids is that the rank function of a matroid on [n] is a submodular function on 2'™' . 
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A random set S C 2" is said to have a product distribution if the events i € S and j g S are independent for every i ^ j. 



on the dimension n. 

Approximate characterization of matroids Our new matroid construction described above can be viewed 
at a high level as saying that matroids can be surprisingly unstructured. One can pick numerous large regions 
of the matroid (namely, the sets A4) and arbitrarily decide whether each region should have large rank or 
small rank. Thus the matroid's structure is very unconstrained. 

Our next result shows that, in a different sense, a matroid's structure is actually very constrained. If one 
fixes any integer k and looks at the rank values amongst all sets of size k, then those values are extremely 
tightly concentrated around their average — almost all sets of size k have nearly the same rank value. 
Moreover, these averages are concave as a function of k. That is, there exists a concave function h : 
[0, n] — > M>o such that almost all sets S have rank approximately /id^l). 

This provides an interesting converse to the well-known fact that the function / : 2™ — )■ E defined by 
f(S) = h(\S\) is a submodular function whenever h : E — > E is concave. Our proof uses our afore- 
mentioned result on concentration for submodular functions under product distributions, and the multilinear 
extension |[T3l of submodular functions, which has been of great value in recent work. 

1.1.2 Learning Submodular Functions 

The Learning Model To study the learnability of submodular functions, we extend Valiant's classic PAC 
model If82l . which captures settings where the learning goal is to predict the future based on past obser- 
vations. The abbreviation PAC stands for "Probably Approximately Correct". The PAC model however is 
primarily designed for learning Boolean-valued functions, such as linear threshold functions, decision trees, 
and low-depth circuits |[82l l54l . For real-valued functions, it is more meaningful to change the model by 
ignoring small-magnitude errors in the predicted values. Our results on learning submodular functions are 
presented in this new model, which we call the PMAC model; this abbreviation stands for "Probably Mostly 
Approximately Correct". 

In this model, a learning algorithm is given a collection S = {S\, S2, ■ ■ ■} of polynomially many sets 
drawn i.i.d. from some fixed, but unknown, distribution D over sets in 2^. There is also a fixed but unknown 
function /* : 2^ — > E+, and the algorithm is given the value of /* at each set in S. The goal is to design 
a polynomial-time algorithm that outputs a polynomial-time-evaluatable function / such that, with large 
probability over S, the set of sets for which / is a good approximation for /* has large measure with respect 
to D. More formally, 



Pr 



Si,S 2 ,...~D 



Prs~D[f(S)<f(S)<af(S)} > 1-e 



> i-s, 



where / is the output of the learning algorithm when given inputs { (Si,f*(Si)) } i=1 2 . The approximation 
factor a > 1 allows for multiplicative error in the function values. Thus, whereas the PMAC model requires 
one to approximate the value of a function on a set of large measure and with high confidence, the traditional 
PAC model requires one to predict the value exactly on a set of large measure and with high confidence. The 
PAC model is the special case of our model with a = 1. 

An alternative approach for dealing with real-valued functions in learning theory is to consider the ex- 
pected squared error of /, which is also called "squared loss". However, this approach does not distinguish 
between the case of having low error on most of the distribution and high error on just a few points, versus 
moderately high error everywhere. In comparison, the PMAC model allows for more fine-grained control 
with separate parameters for the amount and extent of errors, and in addition it allows for consideration of 



multiplicative error which is often more natural in this context We discuss this further in Section 1.3 

Within the PMAC model we prove several algorithmic and hardness results for learning submodular 
functions. 
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Algorithm for product distributions Our first learning result concerns product distribution. This is a 
first natural step when studying learnability of various classes of functions, particularly when the class of 
functions has high complexity |49ll50ll64l 177ft . By making use of our new concentration result for monotone, 
submodular functions under product distributions, we show that if the underlying distribution is a product 
distribution, then sufficiently "smooth" (formally, 1-Lipschitz) submodular functions can be PMAC-learned 
with a constant approximation factor a by a very simple algorithm. 

Inapproximability for general distributions Although 1-Lipschitz submodular functions can be PMAC- 
learned with constant approximation factor under product distributions, this result does not generalize to 
arbitrary distributions. By making use of our new matroid construction, we show that every algorithm for 
PM AC -learning monotone, submodular functions under arbitrary distributions must have approximation 
factor ^(n 1 / 3 ), even if the functions are matroid rank functions. Moreover, this lower bound holds even if 
the algorithm knows the underlying distribution and it can adaptively query the given function at points of 
its choice. 

Algorithm for general distributions Our ^(n 1 / 3 ) inapproximability result for general distributions turns 
out to be close to optimal. We give an algorithm to PMAC-learn an arbitrary non-negative, monotone, 
submodular function with approximation factor 0(y / n). 

This algorithm is based on a recent structural result which shows that any monotone, non-negative, sub- 
modular function can be approximated within a factor of y/n on every point by the square root of a linear 
function [ 321 . We leverage this result to reduce the problem of PMAC-learning a submodular function 
to learning a linear separator in the usual PAC model. We remark that an improved structural result for 
any subclass of submodular functions would yield an improved analysis of our algorithm for that subclass. 
Moreover, the algorithmic approach we provide is quite robust and can be extended to handle more general 
scenarios, including forms of noise. 

The PMAC model Although this paper focuses only on learning submodular functions, the PMAC model 
that we introduce is interesting in its own right, and can be used to study the learnability of other real- valued 
functions. Subsequent work by Badanidiyuru et al. Oil and Balcan et al. Q has used this model for studying 
the learnability of other classes of real-valued set functions that are widely used in algorithmic game theory. 
See Section [T31 for further discussion. 

1.1.3 Other Hardness Implications of Our Matroid Construction 

Algorithmic Game Theory and Economics An important consequence of our matroid construction is that 
matroid rank functions do not have a "sketch", i.e., a concise, approximate representation. Formally, there 
exist matroid rank functions on 2™ that do not have any poly (n) -space representation which approximates 
every value of the function to within a ^(n 1 / 3 ) factor. 

In fact, as matroid rank functions are known to satisfy the gross substitute property [70], our work im- 
plies that gross substitutes do not have a concise, approximate representation, or, in game theoretic terms, 
gross substitutes do not have a bidding language. This provides a surprising answer to an open question in 
economics (3 OH Section 6.2.1]. 

Implications for submodular optimization Many optimization problems involving submodular functions, 
such as optimization over a submodular base polytope, submodular function minimization, and submod- 
ular flow, are very well behaved and their optimal solutions have a rich structure. We consider several 
other submodular optimization problems which have been considered recently in the literature, specifically 
submodular function minimization under a cardinality constraint, submodular s-t min cut and submodular 
vertex cover. These are difficult optimization problems, in the sense that the optimum value is hard to com- 
pute. We show that they are also difficult in the sense that their optimal solutions are very unstructured: the 
optimal solutions do not have a succinct representation, or even a succinct, approximate representation. 
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Formally, the problem of submodular function minimization under a cardinality constraint is 



mm{f(A) : A C [n], \A\ > d} 

where / is a monotone, submodular function. We show that there there is no representation in poly(n) bits 
for the minimizers of this problem, even allowing a factor o(n 1 ' 3 / log n) multiplicative error. In contrast, a 
much simpler construction f33l l79l l32l shows that no deterministic algorithm performing poly(n) queries 
to / can approximate the minimum value to within a factor o(ra 1 / 2 / log n), but that construction implies 
nothing about small-space representations of the minimizers. 

For the submodular s-t min cut problem, which is a generalization of the classic s-t min cut problem 
in network flow theory, we show that there is no representation in poly(re) bits for the minimizers, even 
allowing a factor o(n 1 / 3 / log n) multiplicative error. Similarly, for the submodular vertex cover problem, 
which is a generalization of the classic vertex cover problem, we show that there is no representation in 
poly (ri) bits for the minimizers, even allowing a factor 4/3 multiplicative error. 

1.2 Applications 

Algorithms for learning submodular functions could be very useful in some of the applications where these 
functions arise. For example, in the context of economics, our work provides useful tools for learning the 
valuation functions of (typical) customers, with applications such as bundle pricing, predicting demand, 
advertisement, etc. Our algorithms are also useful in settings where one would like to predict the value of 
some function over objects described by features, where the features have positive but decreasing marginal 
impact on the function's value. Examples include predicting the rate of growth of jobs in cities as a function 
of various amenities or enticements that the city offers, predicting the sales price of a house as a function of 
features (such as an updated kitchen, extra bedrooms, etc.) that it might have, and predicting the demand for 
a new laptop as a function of various add-ons which might be included. In all of these settings (and many 
others) it is natural to assume diminishing returns, making them well-suited to a formulation as a problem 
of learning a submodular function. 

1.3 Related Work 



This section focuses primarily on prior work. Section 8.1 discusses subsequent work that was directly 
motivated by this paper. 

Submodular Optimization Optimization problems involving submodular functions have long played a 
central role in combinatorial optimization. Recently there have been many applications of these optimization 
problems in machine learning, algorithmic game theory and social networks. 

The past decade has seen significant progress in algorithms for solving submodular optimization prob- 
lems. There have been improvements in both the conceptual understanding and the running time of algo- 
rithms for submodular function minimization Il4"3ll4~5ll75ll . There has also been much progress on approxi- 
mation algorithms for various problems. For example, there are now optimal approximation algorithms for 
submodular maximization subject to a matroid constraint |[T3l IT71 |85l , nearly-optimal algorithms for non- 
monotone submodular maximization 1124112511731 . and algorithms for submodular maximization subject to a 
wide variety of constraints 11511151 1251 15D1I5T1 15211751 1551. 

Approximation algorithms for submodular analogues of several other optimization problems have been 
studied, including load balancing ||79ll , set cover PBl [88] . shortest path [31], sparsest cut ||79ll , s-t min 
cut B71 . vertex cover iPTl l44l . etc. In this paper we provide several new results on the difficulty of such 
problems. Most of these previous papers on submodular optimization prove inapproximability results using 
matroids whose rank function has the same form as Eq. ( |1.1[ ), but only for the drastically simpler case of 
k = 1. Our construction is much more intricate since we must handle the case k = n w ' l l 
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Recent work of Dobzinski and Vondrak ETI proves inapproximability of welfare maximization in combi- 
natorial auctions with submodular valuations. Their proof is based on a collection of submodular functions 
that take high values on every set in a certain exponential-sized family, and low values on sets that are far 
from that family. This proof is in the same spirit as our inapproximability result, although their construction 
is technically very different than ours. In particular, our result uses a special family of submodular functions 
and family of sets for which the sets are local minima of the functions, whereas their result uses a different 
family of submodular functions and family of sets for which the sets are local maxima of the functions. 

Learning real-valued functions and the PMAC Model In the machine learning literature BT1I831 . learn- 
ing real- valued functions (in the distributional learning setting) is often addressed by considering loss func- 
tions such as the L 2 -loss (i.e. E x [ (f(x) — f*(x)) 2 ]) or the Li-loss (i.e. E x [\f(x) — f*(x) \ ]). However, 
these do not distinguish between the case of having low error on most of the distribution and high error 
on just a few points, versus moderately high error everywhere. Thus, a lower bound for the L2-I0SS or the 
Li-loss is not so meaningful. In comparison, the PMAC model allows for more fine-grained control with 
separate parameters for the amount and extent of errors. We note that the construction showing the o(n 1 / 3 ) 
inapproximability in the PMAC model immediately implies a ^(n 1 / 3 ) lower bound for the Li-loss and a 
<D(n 2 / 3 ) lower bound for the L2-lossj^] 

Learning Submodular Functions To our knowledge, there is no prior work on learning submodular func- 
tions in a distributional, PAC-style learning setting. The most relevant work is a paper of Goemans et al. [32], 
which considers the problem of "approximating submodular functions everywhere". That paper considers 
the algorithmic problem of efficiently finding a function which approximates a submodular function at every 
set in its domain. They give an algorithm which achieves an approximation factor 0(^/n), and they also 
show Q,{yfn) inapproximability. Their algorithm adaptively queries the given function on sets of its choice, 
and their output function must approximate the given function on every setQ In contrast, our PMAC model 
falls into the more widely studied passive, supervised learning setting ||4j|54j[82j|83l, which is more relevant 
for our motivating application to bundle pricing. 

Our algorithm for PMAC-learning under general distributions and the Goemans et al. algorithm both rely 
on the structural result (due to Goemans et al.) that monotone, submodular functions can be approximated 
by the square root of a linear function to within a factor yjn. In both cases, the challenge is to find this linear 
function. The Goemans et al. algorithm is very sophisticated: it gives an intricate combinatorial algorithm 
to approximately solve a certain convex program which produces the desired function. Their algorithm 
requires query access to the function and so it is not applicable in the PMAC model. Our algorithm, on 
the other hand, is very simple: given the structural result, we can reduce our problem to that of learning a 
linear separator, which is easily solved by linear programming. Moreover, our algorithm is noise-tolerant 



and more amenable to extensions; we elaborate on this in Section |4T4 

On the other hand, our lower bound is significantly more involved than the lower bound of Goemans 
et al. ll32l and the related lower bounds of Svitkina and Fleischer f79l . Essentially, the previous results 
show only show worst-case inapproximability, whereas we need to show average-case inapproximability. 
A similar situation occurs with Boolean functions, where lower bounds for distributional learning are typ- 
ically much harder to show than lower bounds for exact learning (i.e., learning everywhere). For instance, 
even conjunctions are hard to learn in the exact learning model (from random examples or via membership 



3 When talking about the L\ loss or L2 loss one typically normalizes the function. Since the functions in our lower bound are 
matroid rank functions with the codomain {0, 1, . . . , n} there is no need to normalize. 

4 Technically speaking, their model can be viewed as "approximate learning everywhere with value queries", which is not very 
natural from a machine learning perspective. In particular, in many learning applications arbitrary membership or value queries are 
undesirable because natural oracles, such as hired humans, have difficulty labeling synthetic examples |8 1. Also, negative results for 
approximate learning everywhere do not necessarily imply hardness for learning in more widely used learning models. We discuss 
this in more detail below. 
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queries), and yet they are trivial to PAC-learn. Proving a lower bound for PAC-learning requires exhibiting 
some fundamental complexity in the class of functions. It is precisely this phenomenon which makes our 
lower bound challenging to prove. 

Learning Valuation Functions and other Economic Solutions Concepts As discussed in Section 1.2 one 
important application of our results on learning is for learning valuation functions. G. Kalai ll52l considered 
the problem of learning rational choice functions from random examples. Here, the learning algorithm 
observes sets S C [n] drawn from some distribution D, along with a choice c(S) G [n] for each S. The 
goal is then to learn a good approximation to c under various natural assumptions on c. For the assumptions 
considered in [52], the choice function c has a simple description as a linear ordering. In contrast, in our 
work we consider valuation functions that may be much more complex and for which the PAC model would 
not be sufficient to capture the inherent easiness or difficulty of the problem. Kalai briefly considers utility 
functions over bundles and remarks that "the PAC-learnability of preference relations and choice functions 
on commodity bundles ... deserves further study" lIBTTl . 

1.4 Structure of the paper 

We begin with background about matroids and submodular functions in Section [2] In Section [3] we present 
our new structural results: a new extremal family of matroids and new concentration results for submodular 
functions. We present our new framework for learning real-valued functions as well as our results for 
learning submodular functions within this framework in Section [4] We further present implications of our 
matroid construction in optimization and algorithmic game theory in Section [6] and Section [7] 

2 Preliminaries: Submodular Functions and Matroids 

2.1 Notation 

Let [n] denote the set {1,2, ... ,n}. This will typically be used as the ground set for the matroids and 
submodular functions that we discuss. For any set S C [n] and element x G [n], we let S + x denote 
S U {x}. The indicator vector of a set S C [n] is x(S) G {0, l} ra , where x(S)i is 1 if i is in S and 
otherwise. We frequently use this natural isomorphism between {0, l} n and 2^. 

2.2 Submodular Functions and Matroids 

In this section we give a brief introduction to matroids and submodular functions and discuss some standard 
facts that will be used throughout the paper. A more detailed discussion can be found in standard references 
|[28l l29l l65l l74l l76l . The reader familiar with matroids and submodular functions may wish to skip to 
Section |3] 

Let V = {v\, ... ,v n } be a collection of vectors in some vector space F m . Roughly one century ago, sev- 
eral researchers observed that the linearly independent subsets of V satisfy some interesting combinatorial 
properties. For example, if B C V is a basis of F m and I C V is linearly independent but not a basis, then 
there is always a vector v G B which is not in the span of /, implying that I + v is also linearly independent. 

These combinatorial properties are quite interesting to study in their own right, as there are a wide variety 
of objects which satisfy these properties but (at least superficially) have no connection to vector spaces. A 
matroid is defined to be any collection of elements that satisfies these same combinatorial properties, without 
referring to any underlying vector space. Formally, a pair M = ([n],X) is called a matroid if X C 2^ is a 
non-empty family such that 

• if J C I and I G I, then J G T, and 

• if /, J G T and \J\ < \I\, then there exists an i G / \ J such that J + i G T. 
The sets in X are called independent. 

Let us illustrate this definition with two examples. 



Partition matroid Let V± U ■ ■ • U V\. be a partition of [n], i.e., V{ n Vj = whenever z / j. Define X C 

be the family of partial transversals of [n], i.e., I G X if and only if |7 n < 1 for alH = 1, . . . , fc. 
It is easy to verify that the pair ( [n] , X) satisfies the definition of a matroid. This is called a partition 
matroid. 

This definition can be generalized slightly. Let I G X if and only if \I fl Vi\ < bi for alH = 1, . . . , k, 
where the bi values are arbitrary. The resulting pair ( [n] , X) is a (generalized) partition matroid. 

Graphic matroid Let G be a graph with edge set E. Define X C 2 E to be the collection of all acyclic sets 
of edges. One can verify that the pair ([n],X) satisfies the definition of a matroid. This is called a 
graphic matroid. 

One might wonder: given an arbitrary matroid ([n],X), do there necessarily exist vectors V = {v±, . . . , v n } 
in some vector space for which the independent subsets of V correspond to X? Although this is true for par- 
tition matroids and graphic matroids, in general the answer is no. So matroids do not capture all properties 
of vector spaces. Nevertheless, many concepts from vector spaces do generalize to matroids. 

For example, given vectors V C F m , all maximal linearly independent subsets of V have the same 
cardinality, which is the dimension of the span of V. Similarly, given a matroid ([n] , X), all maximal sets in 
X have the same cardinality, which is called the rank of the matroid. 

More generally, for any subset V' C V, we can define its rank to be the dimension of the span of V'\ 
equivalently, this is the maximum size of any linearly independent subset of V'. This notion generalizes 
easily to matroids. The rank function of the matroid Qn],X) is the function rankM : 2^ — > N defined by 

rank] V i(5') := max{ |/| : I C S, I G X } . 

Rank functions also turn out to have numerous interesting properties, the most interesting of which is the 
submodularity property. Let us now illustrate this via an example. Let V" C V C V be collections of 
vectors in some vector space. Suppose that v G V is a vector which does not lie in span(V r/ ). Then it is 
clear that v does not lie in span(y ;/ ) either. Consequently, 

rank(y' + v) - rank(F') = 1 rank(V r// + v) - rank(F // ) = 1. 

The submodularity property is closely related: it states that 

rankM(X + x) — rank]vi(X) < ranki^S" + x) — rankM(<5) VS* C T C [n], x G [n]. 

The following properties of real-valued set functions play an important role in this paper. The function 
/ : 2N Ris 

• Normalized if /(0) = 0. 

• Non-negative if f(S)>0 for all S. 

• Monotone (or non-decreasing) if f(S) < f(T) for all S CT. 

• Submodular if it satisfies 

f(T + x)-f(T) < f(S + x)-f(S) VSCTC [ n ],x€ [n]. (2.1) 
An equivalent definition is as follows 

f(A) + f(B) > f(A U B) + f(A n B) VACBC{n}. (2.2) 

• L-Lipschitz if \f(S + x)- f(S)\ < L for all S C [n] and x G [n]. 

Matroid rank functions are integer- valued, normalized, non-negative, monotone, submodular and 1-Lipschitz. 
The converse is also true: any function satisfying those properties is a matroid rank function. 

The most interesting of these properties is submodularity. It turns out that there are a wide variety of set 
functions which satisfy the submodularity property but do not come from matroids. Let us mention two 
examples. 
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Figure 3.1: This figure aims to illustrate a function ranking that is consuucted by Theorem |lj This is a 
real-valued function whose domain is the lattice of subsets of V. The family B contains the sets A\ and A%, 
both of which have size n 1 / 3 . Whereas rankM e (S) is large (close to n 1 / 3 ) for most sets S of size n 1 / 3 , we 
have rankM s (^4i) = rankM B (^42) = 81og 2 ra. In order to ensure submodularity, sets near A\ or A2 also 
have low values. 



Coverage function Let Si, . . . , S n be a subsets of a ground set [m]. Define the function / : 2™ — > N by 

fW = \ (j Si 

Si : i&I 

This is called a coverage function. It is integer-valued, normalized, non-negative, monotone and 
submodular, but it is not 1-Lipschitz. 

Cut function Let G = ([n], E) be a graph. Define the function / : 2^ — Y N by 

f(U) = \S(U)\ 

where S(U) is the set of all edges that have exactly one endpoint in U. This is called a cut function. It 
is integer-valued, normalized, non-negative and submodular, but it is not monotone or 1-Lipschitz. 

3 New Structural Results About Matroids and Submodular Functions 

3.1 A New Family of Extremal Matroids 

In this section we present a new family of matroids whose rank functions take wildly varying values on 
many sets. The formal statement of this result is as follows. 

Theorem 1. For any k = 2°( nl/3 ), there exists a family of sets A C 2^ and a family of matroids Ai = 
{ Mg : B C A } with the following properties. 

• \A\ = k and \A\ = n 1/3 for every A G A. 

• For every B C A and every A £ A, we have 

[8 log k (ifAeB) 



rank MB (-4) 



\A\ (ifAeA\B). 
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Theorem [I] implies that there exists a super-polynomial-sized collection of subsets of [n] such that, for 
any labeling of those sets as HIGH or Low, we can construct a matroid where the sets in HIGH have rank 
rhigh and the sets in Low have rank n ow , and the ratio n^gh/How = &{n 1 / 3 ). For example, by picking 
k = n log n , in the matroid Mg, a set A has rank only 0(log 2 n) if A G B, but has rank n 1 / 3 if A G A \ B. 
In other words, as B varies, the rank of a set A G A varies wildly, depending on whether A G B or not. 

Later sections of the paper use Theorem [T] to prove various negative results. In Section 4.3 we use the 
theorem to prove our inapproximability result for PMAC-learning submodular functions under arbitrary 
distributions. In Section [6] we use the theorem to prove results on the difficulty of several submodular 
optimization problems. 

In the remainder of Section [3T] we discuss Theorem [T] and give a detailed proof. 

3.1.1 Discussion of Theorem Q] and Sketch of the Construction 

We begin by discussing some set systems which give intuition on how Theorem [T] is proven. Let A = 
{Ai, . . . , Ak] be a collection of subsets of [n] and consider the set system 

X = { I : \I\ < r A \inAj\<bj Vj G [k] } . 

If X is the family of independent sets of a matroid M, and if rankM (^4j ) = bj for each j, then perhaps such 
a construction can be used to prove Theorem[T] 

Even in the case k = 2, understanding X is quite interesting. First of all, X typically is not a matroid. 
Consider taking n = 5, r = 4, A\ = {1, 2, 3}, A 2 = {3, 4, 5} and b x = b 2 = 2. Then both {1, 2, 4, 5} and 
{2, 3, 4} are maximal sets in X but their cardinalities are unequal, which violates a basic matroid property. 
However, one can verify that X is a matroid if we additionally require that r < b\ + b 2 — \ A\ n A 2 \ . In fact, 
we could place a constraint on \I n (Ai U A 2 ) \ rather than on |7|, obtaining 

{/:|/n^i|<6i A |/ n A 2 | < & 2 a I /n (A x UA 2 )\ < h + b 2 - \Ai n A 2 \ } , 

which is the family of independent sets of a matroid. In the case that A\ and A 2 are disjoint, the third 
constraint becomes \I n (A± U < b\ + b 2 , which is redundant because it is implied by the first two 
constraints. In the case that A\ and A 2 are "nearly disjoint", this third constraint becomes necessary and it 
incorporates an "error term" of — | A\ n A 2 \. 

To generalize to k > 2, we impose similar constraints for every subcollection of A, and we must include 
additional "error terms" that are small when the A, 's are nearly disjoint. Theorem |2]proves that 

X = { I : \ln A(J)\ < g(J) VJ Q[k]} . (3.1) 

is a matroid, where the function g : 2^1 — > Z is defined by 

g{J) := ~ (^MM-MKJ)|), where A(J) := \J Aj. (3.2) 

jeJ j&J jeJ 

In the definition of g(J), we should think of — ( X^jejl A?l — 1^(^)1) as an " error term", since it is non- 
positive, and it captures the "overlap" of the sets { Aj : j G J }. In particular, in the case J = {1, 2}, this 
error term is — | A\ n A 2 \, as it was in our discussion of the case k = 2. 

Let us now consider a special case of this construction. If the Aj's are all disjoint then the error terms are 
all 0, so the family X reduces to 

{ / : \Ir\Aj\ < bj Vj G [k] }, 

which is a (generalized) partition matroid, regardless of the bj values. Unfortunately these matroids cannot 
achieve our goal of having superpolynomially many sets labeled HIGH or Low. The reason is that, since 
the Aj 's must be disjoint, there can be at most n of them. 

In fact, it turns out that any matroid of the form ( |3.1| > can have at most n sets in the collection A. To 
obtain a super-polynomially large A we must modify this construction slightly. Theorem [3] shows that, 
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under certain conditions, the family 

X = { I : \I\ < d A \I n A(J)\ < g{J) VJ C [jfe], | J| < r } 

is also the family of independent sets of a matroid. 

There is an important special case of this construction. Suppose that \Aj\ = d and bj = d — 1 for every 
j, and that \A{ n Aj| < 2 for all i / j. The resulting matroid is called a paving matroid, a well-known type 
of matroid. These matroids are quite relevant to our goals of having super-polynomially many sets labeled 
High and Low. The reason is that the conditions on the Aj's are equivalent to A being a constant-weight 
error-correcting code of distance 4, and it is well-known that such codes can have super-polynomial size. 
Unfortunately this construction has r\ ow = d — 1 and r^igh = d; this small, additive gap is much too weak 
for our purposes. 

The high-level plan underlying Theorem [T] is to find a new class of matroids that somehow combines the 
positive attributes of both partition and paving matroids. From paving matroids we will inherit the large size 
of the collection A, and from partition matroids we will inherit a large ratio Hugh/low 

One of our key observations is that there is a commonality between partition and paving matroids: the 
collection A must satisfy an "expansion" property, which roughly means that the Aj's cannot overlap too 
much. With partition matroids the Aj 's must be disjoint, which amounts to having "perfect" expansion. With 
paving matroids the Aj's must have small pairwise intersections, which is a fairly weak sort of expansion. 

It turns out that the "perfect" expansion required by partition matroids is too strong for A to have super- 
polynomial size, and the "pairwise" expansion required by paving matroids is too weak to allow a large ratio 
f high/How Fortunately, weakening the expansion from "perfect" to "nearly-perfect" is enough to obtain a 
collection A of super-polynomial size. With several additional technical ideas, we show that these nearly- 
perfect expansion properties can be leveraged to achieve our desired ratio Thigh /How = ^(n 1 / 3 ). These 
ideas lead to a proof of Theorem [T] 

3.1.2 Our New Matroid Constructions 



Our first matroid construction is given by the following theorem, which is proven in Section 3.1.3 



Theorem 2. The family X given in Eq. ( |3.1[ ) is the family of independent sets of a matroid, if it is non-empty. 

As mentioned above, Theorem [2] does not suffice to prove Theorem [I] To see why, suppose that \A\ = 
k > n and that 6j < \Ai\ for every i. Theng([fc]) < n — k < 0, and therefore X is empty. So the construction 
of Theorem|2]is only applicable when k < n, which is insufficient for proving Theorem[T] 

We now modify the preceding construction by introducing a sort of "truncation" operation which allows 
us to take k 3> n. We emphasize that this truncation is not ordinary matroid truncation. The ordinary trun- 
cation operation decreases the rank of the matroid, whereas we want to increase the rank by throwing away 
constraints in the definition of X. We will introduce an additional parameter r, and only keep constraints for 
| J | < t. So long as g is large enough for a certain interval, then we can truncate g and still get a matroid. 

Definition 1. Let d and r be non-negative integers. A function g : 2^ — > E is called (d, r)-large if 

9(J)>0 VJC[fc], \J\<t 

g(J)>d VJ C [k], t < \J\ < 2r — 2. K ' 

The truncated function g : 2^ — > Z is defined by 

5 ( J) :=(»< J > (if ' J l <T) 
I d (otherwise). 

Theorem 3. Suppose that the function g defined in Eq. ( |3.2[ ) is (d, r)-large. Then the family 

X = { I : | J n A(J)\ < g(J) VJ C [jfe] } 
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is the family of independent sets of a matroid. 
Consequently, we claim that the family 

X = { I : \I\ < d A \ln A(J)\ < g(J) VJ C [k], \ J\ < r } 

is also the family of independent sets of a matroid. This claim follows immediately if the Aj's cover the 
ground set (i.e., ^4([/c]) = [n]), because the matroid definition in Theorem [3] includes the constraint |/| = 
\I n j4( [/c] ) | < g([k]) = d. Alternatively, if ^ [n], we may we apply the well-known matroid 

truncation operation which constructs a new matroid simply by removing all independent sets of size greater 
than d. 

This construction yields quite a broad family of matroids. We list several interesting special cases in Ap- 
pendix^ In particular, partition matroids and paving matroids are both special cases. Thus, our construction 
can produce "non-linear" matroids (i.e., matroids that do not correspond to vectors over any field), as the 
Vamos matroid is a paving matroid that is non-linear IT7411 . 

3.1.3 Proofs of Theorem|2]and Theorem|3] 

In this section, we will prove Theorem [2] and Theorem [3] We start with a simple but useful lemma which 
describes a general set of conditions that suffice to obtain a matroid. 

Let C C 2'"' be an arbitrary family of sets and let g : C — > Z be a function. Consider the family 

X = { I : \I n C| < g(C) VCeC}. (3.4) 

For any I G X, define T(I) = {C G C : \I n C\ = g(C) } to be the set of constraints that are "tight" for 
the set I. Suppose that g has the following property: 

v/ g x, Ci, C 2 € T(i) =>- (Ci u c 2 g r(/)) v (Ci n c 2 = 0). (3.5) 

Properties of this sort are commonly called "uncrossing" properties. Note that we do not require that C\ n 
C2 G C. We show in the following lemma that this uncrossing property is sufficient to obtain a matroid. 



Lemma 1. Assume that Eq. (3.5) holds. Then I is the family of independent sets of a matroid, if it is 
non-empty. 

Proof. We will show that X satisfies the required axioms of an independent set family. If I C I' G X 
then clearly / G X also. So suppose that / G X, I' G X and |/| < Let C\, . . . , C m be the maximal 
sets in T(I) and let C* = Uj Cj. Note that these maximal sets are disjoint, otherwise we could replace any 
intersecting sets with their union. In other words, Ci n Cj ■■ = for i ^ j, otherwise Eq. ( |3.5| > implies that 
Ci U Cj G T(I), contradicting maximality. So 

m mm 

\i'nc*\ = ^ll'nCil < J^g(Ci) = Y,\ In °i\ = \ir\C*\. 

i=l i=l i=l 

Since |/'| > |7| but \I' nC*\ < \InC*\, we must have that \I' \ C*\ > \I \ C*\. The key consequence 
is that some element x G I' \ I is not contained in any tight set, i.e., there exists x G I' \ (C* U /). Then 
/ + x G X because for every C 3 x we have \I n C| < (7(C) — 1. ■ 

We now use Lemma[T]to prove Theorem|2] restated here. 

Theorem [2} The family X defined in Eq. ( |3.1| >, namely 

X = { J : |/nA(J)| < 5 (J) VJC [jfe] }, 



5 There are general matroid constructions in the literature which are similar in spirit to Lemma [T] e.g., the construction of 
Edmonds 1221 Theorem 15] and the construction of Frank and Tardos [76 Corollary 49.7a]. However, we were unable to use those 
existing constructions to prove Theorem|2]or Theorem[3] 
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where 

g(J) := J> - (Y)Aj\-\A(J)\) and A{J) := \JA 3 , 

is the family of independent sets of a matroid, if it is non-empty. 

This theorem is proven by showing that the constraints defining X can be "uncrossed" (in the sense that 
they satisfy ( |3.5| >), then applying Lemma[T] It is not a priori obvious that these constraints can be uncrossed: 
in typical uses of uncrossing, the right-hand side g{ J) should be a submodular function of J and the left- 
hand side \I n A(J)\ should be a supermodulai function of J. In our case both g(J) and \I n -A(J)| are 
submodular. 

Proof (of Theorem [2]). The proof applies Lemma[T]to the family C = { A(J) : J C [k] }. We must also 
define a function g' : C — > Z. However there is a small issue: it is possible that there exist J ^ J' with 
A(J) = A(J') but g(J) 7^ g(J'), so we cannot simply define g'(A(J)) = g(J). Instead, we define the 
value of g'(A(J)) according the tightest constraint on \I n A(J)\, i.e., 

g'(C) := mm{ g(J) : A(J) = C} VC G C. 

Now fix I G X and suppose that Ci and C 2 are tight, i.e., \I n Ci\ = g'{Ci). Define h I : 2^ ^ Z by 

hj(J) := g(J)-\InA(J)\ = \A(J)\I\-Y,(\^\-b :i ). 

We claim that hj is a submodular function of J. This follows because J h-> |^4(J) \ J| is a submodular 



function of J (cf. Theorem 24 in Appendix A.l 1, and J 1— > Yljej(\Aj\ — bj) is a modular function of J. 
Now choose Jj satisfying - ^ = A(Ji) and g (Ci) = g(J%), for both i G {1, 2}. Then 

hM) = g (Ji) - \i n = </(Ci) - |J n Q| = 0, 

for both i G {1, 2}. However hj > 0, since we assume / G X and therefore \I n -A(J)| < for all J. 
So we have shown that J\ and J2 are both minimizers of hj. It is well-known that the minimizers of any 
submodular function are closed under union and intersection. (See Lemma[7]in Appendix A.l ) So J\ U J2 



and J\ n J 2 are also minimizers, implying that A( J\ U J2) = A( J\) U A( J2) = C\ U C2 is also tight. 
This shows that Eq. ( |3.5| > holds, so the theorem follows from Lemma [T] ■ 

A similar approach is used for our second construction. 

Proof (of Theorem[3]). Fix / G X. Let J\ and J2 satisfy |Jn A( Ji)\ = g(Ji). By considering two cases, we 
will show that \I fl A(J\ U J2) I > 5(^1 U J2), so the desired result follows from Lemma [I] 
Case 1: max{| J±\, | J2I} > r. Without loss of generality, | Ji| > | J2I. Then 

g(J 1 UJ 2 ) = d = g{J 1 ) = |JnA(Ji)| < |/D^(JiU J 2 )|. 

Caje2: max{| Ji|, | J 2 |} < r - 1. So | Ji U J 2 | < 2r - 2. We have |Jn A(Jj)| = 5(Jj) = g(Ji) for both i. 
As argued in Theorem|2| we also have \I n A(J\ U J2) | = g{Ji U J2). But <?(Ji U J2) > g( J\ U J2) since 
<7 is (d, r)-large. ■ 

3.1.4 Putting it all together: Proof of Theorem]!] 

In this section we use the construction in Theorem [3] to prove Theorem [T] which is restated here. 

Theorem [l] For any k = 2°( nl/3 ), there exists a family of sets A C 2^ and a family of matroids 
M. = { Mg : 13 C A} with the following properties. 
• \A\ = k and \ A\ = n 1 / 3 for every A G A. 
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• For every B C A and every A £ A, we have 

(slogk (if A £ B) 
rank M „(4) = (ifA , MB) , 

To prove this theorem, we must construct a family of sets A = {A\, . . . , A^} where each \ A{\ = n 1 / 3 , 
and for every 6 C iwe must construct a matroid Mg with the desired properties. It will be convenient 
to let d = n 1 / 3 denote the size of the Ai's, to let the index set of A be denoted by U := [k], and to let the 
index set for B be denoted by U B := { i £ U : A{ £ B }. Each matroid Mg is constructed by applying 
Theorem [3] with the set family B instead of A, so its independent sets are 

T B := { I : |J| < d A |/n A(J)\ < g B {J) VJ C U B , \ J\ < r }. 
where the function : 2 Ub — > R is defined as in Eq. ( |3.2| ), taking all 6j's to be equal to a common value b: 
g B {J) :=J2 b ~ (EMM-I^( j )l) = (&-d)|J| + ML(J)| VJC[/ B . 



Several steps remain. We must choose the set family A, then choose parameters carefully such that, for 
every B C A, we have 

• PI: Ms is indeed a matroid, 

• P2: rank]yi B (j4i) = 8 log k for all £ £>, and 

• P3: rank Mg ( J 4j) = for all Ai£ A\B. 

Let us start with P2. Suppose £ The definition of X B includes the constraint \I n Aj| < g B ({i}), 
which implies that rankM B (^4i) < 9b({^}) = b. This suggests that choosing b := 81og/c may be a good 
choice to satisfy P2. 

On the other hand, if Ai B then P3requires that Ai is independent in Mg. To achieve this, we need the 
constraints \I n A( J)| < g B (J) to be as loose as possible, i.e., g B (J) should be as large as possible. Notice 
that g B (J) has two terms, Ylj&j which grows as a function of J, and — ( X^ejl^il — 1-^(^)1)' which is 
non-positive. So we desire that should be as close as possible to ^2j & j\Aj\, for all J with \ J\ < r. 

Set systems with this property are equivalent to expander graphs. 

Definition 2. Let G = (U U V, E) be a bipartite graph. For J C U, define 

T( J) := { v : 3u £ J such that {u, v} £ E } . 
77je graph G is called a (d, L, e)-expander if 

\F({u})\ = d Vn £ U 

|r(J)| >(l-e)-d-|J| VJCU, \J\<L. 

Additionally, G is called a lossless expander if e < 1 /2. 

Given such a graph G, we construct the set family A = {A\, . . . , ^4^} C 2^ by identifying U = [k], 
V = [n], and for each vertex i £ U defining A4 := F({i}). The resulting sets satisfy: 

\Ai\ = d Vt £ C/ 

|^(J)| > (l-e)-d-|J| VJCf/, |J|<L (3.6) 
=> Ejejl^il - < e-d-|J| VJC[/,|J|<L. 

This last inequality will allow us to show that g B {J) is sufficiently large. 

To make things concrete, let us now state the expander construction that we will use. Lossless expanders 
are well-studied ll38l l42l . and several probabilistic constructions are known, both in folklore and in the 
literature HU Lemma 3.10], (42J §1.2], (78j Theorem 26], ED Theorem 4.4]. The following construction 
of Buhrman et al. lTT2l Lemma 3. 10] has parameters that match our requirements. 
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Theorem 4. Suppose k > 8, n > 25Llog(/c)/e 2 , and d > log(fc)/2e. Then there exists a graph G = 
(U U V, E) with \ U\ = k and \ V\ = n that is a (d, L, e)-lossless expander. 

For the sake of completeness, we state and prove a different probabilistic construction that also matches 
our requirements. The proof is in Appendix [Pj 

Theorem 5. Let G = (U U V, E) be a random multigraph where \ U\ = k, \ V\ = n, and every u G U has 
exactly d incident edges, each of which has an endpoint chosen uniformly and independently from all nodes 
in V. Suppose that k > A, d > log(fc)/e and n > IQLd/e. Then 

Pr [G is a (d, L, e)-lossless expander and has no parallel edges ] > 1 — 2/k. 
We require an expander with the following parameters. Recall that n is arbitrary and k = 2°( ral/3 ). 



d := n 1/3 L 



n 1 / 3 21ogfc 



2 log A; ' nV3 

These satisfy the hypotheses of Theorem[4](and Theorem[5]), so a (d, L, e)-expander exists, and a set family 
A satisfying Eq. ( |3.6| > exists. Next we use these properties of A to show that PI, P2 and P3 hold. 
The fact that PI holds follows from Theorem [3] and the following claim. Recall that 6 = 8 log k. 

Claim 1. Set r = n 1 / 3 /4 log k. Then gg is (d, r)-large, as defined in (3.3 1. 
Proof. Consider any J C[/ 6 with \J\ < 2r - 2. Then 
g B {J) = (b - d)\J\ + \A(J)\ 

> b\J\- ed\J\ (by Eq. <|3T6j), since | J| < 2r - 2 < L) 

Q 1 

= — \J\ (since e = b/Ad). (3.7) 

This shows g B (J) > 0. If additionally \J\>T then g B (J) > (3/4)6t > d. ■ 

The following claim implies that P2 holds. 
Claim 2. For all B C A and all A{ G B we have ranking (AA = b. 

Proof. The definition of 1 B includes the constraint \I n Ai\ < g B {{i}) = b. This immediately implies 
rankM B (^4j) < b. To prove that equality holds, it suffices to prove that g B {J) > b whenever |J| > 1, 
since this implies that every constraint in the definition of X B has right-hand side at least b (except for the 
constraint corresponding to J = 0, which is vacuous). For | J\ = 1 this is immediate, and for | J| > 2 we 
use fl377l) to obtain g B (J) = 3b\ J|/4 > b. ■ 

Finally, the following claim implies that P3 holds. 

Claim 3. For all B C A and all Ai £ A \ B we have ranking (Ai) = d. 

Proof. Since d = the condition rankM 8 (^4j) = d holds iff Ai G T B . So it suffices to prove that Ai 
satisfies all constraints in the definition of X B . 

The constraint \Ai\ < cZ is trivially satisfied, by Eq. ( |3.6[ ). So it remains to show that for every J C U B 
with | J| < r, we have 

\AinA(J)\<g B (J). (3.8) 

This is trivial if J = 0, so assume | J| > 1. We have 

\AinA(J)\ = \Ai\ + \A(J)\-\A(J + i)\ 

< d + d\J\ - (1 - e)d\J + i\ (by Eq. dgg}) 



< 



b\J + i\ 
I 

b\J\ 



(since e = b/Ad) 
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< g B (J) (byEq.^77)). 
This proves Eq. ( |3.8| >, so ylj G Zg, as desired. 



3.2 Concentration Properties of Submodular Functions 

In this section we provide a new strong concentration bound for submdoular functions. 

Theorem 6. Let f : 2^ — >■ M + &e a non-negative, monotone, submodular, 1-Lipschitz function. Let the 
random variable X C [n] have a product distribution. For any b,t > 0, 

Pr [/(X) < b-tVb] -Pv[f(X) > b] < exp(-i 2 /4). (3.9) 

To understand Theorem [6} it is instructive to compare it with known results. For example, the Chernoff 
bound is precisely a concentration bound for linear, Lipschitz functions. On the other hand, if / is an arbi- 
trary 1-Lipschitz function then McDiarmid's inequality implies concentration, although of a much weaker 
form, with standard deviation roughly ^fn. If / is additionally known to be submodular, then we can apply 
Theorem [6] with b equal to a median, which can be much smaller than n. So Theorem [6] can be viewed as 
saying that McDiarmid's inequality can be significantly strengthened when the given function is known to 
be submodular. 

Our proof of Theorem [6] is based on the Talagrand inequality Il80l l3l l69ll46ll . Independently, Chekuri et 
al. IPT51 proved a similar result using the FKG inequality. Concentration results of this flavor can also be 
proven using the framework of self-bounding functions iTTTIl . as observed in an earlier paper by Hajiaghayi 
et al. [39] (for a specific class of submodular functions); see also the survey by Vondrak [87]. 

Theorem [6] most naturally implies concentration around a median of f(X). By standard manipulations, 
e.g., |[46l §2.5] or ll69l §20.2], this also implies concentration around the expected value. We obtain: 

Corollary 1. Let f : 2^ — > IR + be a non-negative, monotone, submodular, 1-Lipschitz function. Let the 
random variable X C [n] have a product distribution. For any < a < 1 and ifE [ f(X) ] > 240/a, then 

Pv[\f(X)-E[f(X)]\>aE[f(X)}} < 4exp(-a 2 E[/(X)]/16). 

As an interesting application of Corollary [T] let us consider the case where / is the rank function of a 
linear matroid. Formally, fix a matrix A over any field. Construct a random submatrix by selecting the i th 
column of A with probability pi, where these selections are made independently. Then Corollary [T] implies 
that the rank of the resulting submatrix is highly concentrated around its expectation, in a way that does not 
depend on the number of rows of A. 

The proofs of this section are technical applications of Talagrand's inequality and are provided in Ap- 
pendix |B] Later sections of the paper use Theorem[6]and Corollary [T]to prove various results. In Section 4.2 



we use these theorems to analyze our algorithm for PMAC-learning submodular functions under product 
distributions. In Section [5] we use these theorems to give an approximate characterization of matroid rank 
functions. 

4 Learning Submodular Functions 

4.1 A New Learning Model: The PMAC Model 

In this section we introduce the PMAC model for learning real-valued functions, which models learning 
real- valued functions in the passive, supervised learning paradigm. There is a space X of points and a fixed 
but unknown distribution D on X. The points in X are called "examples". There is a fixed but unknown 
function /* : X — > M.+, which is called the "target function", assigning a value to each example in X. The 
values assigned by /* are called "labels". In this model, a learning algorithm is provided a set S of examples, 
called "training examples", drawn i.i.d. from D. The algorithm is also provided the labels assigned by /* 
to the training examples. The algorithm may perform an arbitrary polynomial time computation on the 
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labeled examples S, then must output another function / : X — > R + . This function is called a "hypothesis 
function". The goal is that, with high probability, / is a good approximation of /* for most points in D. 
Formally: 

Definition 3. Let T be a family of non-negative, real-valued functions with domain X. We say that an 
algorithm A PMAC-learns J- with approximation factor a if, for any distribution D over X, for any target 
function f* £ T, and for e > and 8 > sufficiently small: 

• The input to A is a sequence of pairs {(xi, /*(a ; i))}i<i<£ where each xi is i.i.d.from D. 

• The number of inputs £ provided to A and the running time of A are both at most poly (n, l/e,l/S). 

• The output of A is a function f : X — > M that can be evaluated in time poly(n, 1/e, 1/8) and that 
satisfies 



Pr 



x\ ,...,xt^D 



Pw[/(x) </*(*) <«•/(»] > 1-e 



> 1 



The name PMAC stands for "Probably Mostly Approximately Correct". It is an extension of the PAC 
model to learning non-negative, real-valued functions, allowing multiplicative error a. The PAC model for 
learning boolean functions is precisely the special case when a = 1. 

In this paper we focus on the PMAC-learnability of submodular functions. In this case X = {0, 1}". 
We note that it is quite easy to PAC-learn the class of boolean submodular functions. Details are given in 



Appendix C.l The rest of this section considers the much more challenging task of PMAC-learning the 



general class of real-valued, submodular functions. 
4.2 Product Distributions 

A first natural and common step in studying learning problems is to study learnability of functions when 
the examples are distributed according to the uniform distribution or a product distribution Il49ll56ll64l . In 
this section we consider learnability of submodular functions when the underlying distribution is a product 
distribution and provide an algorithm that PMAC learns Lipschitz submodular functions with a constant 
approximation factor. 

We begin with the following technical lemma which states some useful concentration bounds. 

Lemma 2. Let f : 2^ — > M. be a non-negative, monotone, submodular, 1-Lipschitz function. Sup- 
pose that Si, . . . , Si are drawn from a product distribution D over 2^1 Let p, the empirical average 
fi = J2i=i f*(Si)/£, which is our estimate for Ej^d [f*(S) ]. Let e, S < 1/5. We have: 

(1) IfE[f*(S)} > 5001og(l/e) and I > 121og(l/<5) then 

Pr [fj, > 4501og(l/e)] > 1 - 5/4. 

(2) TjfE [/*(£)] > 4001og(l/e) and £ > 121og(l/<S) then 

Pr[|E[/*(5)]< M <|E[r(S)]] > 1-5/4. 

(3) TjfE [/*(£)] < 5001og(l/e) and £ > 12 log(l/£) then 

Pr [/*(£) < 12001og(l/e)] > 1 - e. 

(4) //E[/*(5)] < 4001og(l/e) and £ > 121og(l/fl) then 

Pr [p < 4501og(l/e)] > 1-5/4. 



The proof of Lemma[2]follows easily from Theorem[6]and Corollary [T]and it is provided in Appendix C.2 
We now present our main result in this section. 

Theorem 7. Let T be the class of non-negative, monotone, 1-Lipschitz, submodular functions with ground 
set [n] and minimum non-zero value 1. Let D be a product distribution on {0, l} n . For any sufficiently small 
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Algorithm 1 An algorithm for PMAC-learning a non-negative, monotone, 1-Lipschitz, submodular func- 
tion /* when the examples come from a product distribution. Its input is a sequence of labeled training 
examples (Si, f*(Si)), . . . , (Se, f*(Sp)), parameters e and I. 

• Let /z = £f=i 

• Case 1: If fx > 450 log(l/e), then return the constant function / = /x/4. 

• Case 2: If p, < 4501og(l/e), then compute the set U = Uj-/*(s )=o ^i- R etum tne function / 
where f(A) = if A C 17 and /(A) = 1 otherwise. 



e > £?«<f 5 > 0, A/gon7/jra[7],PMAC-/ea , r«5' 7- w/?/i approximation factor a = 0(log(l/ e)). The number of 
training examples used is I = n log(n/<5) /e + 12 log(l/ 5). 

If it is known a priori that E [f*(S) ] > 5001og(l/e) then the approximation factor improves to 8, and 
the number of samples can be reduced to i = 12 log(l/<5), which is independent ofn and e. 
Proof. We begin with an overview of the proof. Consider the expected value of /* (S) when 5 is drawn 
from distribution D. When this expected value of /* is large compared to log(l/e), we simply output a 
constant function given by the empirical average p estimated by the algorithm. Our concentration bound for 
submodular functions ( Corollary [TJ allows us to show that this constant function provides a good estimate. 
However, when the expected value of /* is small, we must carefully handle the zeros of /*, since they may 
have large measure under distribution D. The key idea here is to use the fact that the zeros of a non-negative, 
monotone, submodular function have special structure: they are both union-closed and downward-closed, 
so it is sufficient to PAC-learn the Boolean NOR function which indicates the zeros of /*. 

We now present the proof formally. Let us first consider the empirical average /x = 2~2i=i J '*{Si)/t, 
which is our estimate for 'Es^d [f*{S) ]■ We can analyze the accuracy of this estimate using Theorem [6] 
and Corollary [T] because /* is monotone, submodular and 1-Lipschitz, and these properties are preserved 
when summing copies of / over I disjoint copies of the ground set. 

By Lemma [2| with probability at least 1 — 5, we may assume that the following implications hold. 

M > 4501og(l/e) =► E[/*(5)]> 4001og(l/e) and § E [f(S) ] < fx < f E [f*(S) ] 

/i< 4501og(l/e) E [/*(£)] < 5001og(l/e). 

Now we show that the function / output by the algorithm approximates /* to within a factor 0(log(l/e)). 

Case I: n> 4501og(l/e). By our assumed implication we have §E [/*(£)] < p< | and E [/*(£)] > 
400 log(l/ e). Using these together with Corollary [I] we obtain: 

Pr[/V4</*(S)<2 M ] > Pr[lB[r(S)]<r(S)<lB[f*(S)}} 

> l-Pr[\f*(S)-E[f*(S)]\>(2/3)E[f*(S)]} (4.1) 

> i_ 4e -E[r(S)]/ioo > 

assuming e is sufficiently small. Therefore, with confidence at least 1 — 5, the constant function / output by 
the algorithm approximates /* to within a factor 8 on all but an e fraction of the distribution. 

Case 2: /i < 4501og(l/e). As mentioned above, we must separately handle the zeros and the non-zeros of 
/*. To that end, define 

V = { S : f*(S) > } and Z = { S : f*(S) = } . 

Recall that the algorithm sets U = U/*(s )=o ^i- Monotonicity and submodularity imply that f*(U) = 0. 
Furthermore, setting C = {T : T C U }, monotonicity implies that 

f*{T) = VT G C (4.2) 

We wish to analyze the measure of the points for which the function / output by the algorithm fails to 
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provide a good estimate of /* . So let S be a new sample from D and let £ be the event that S violates the 
inequality 

f(S) < f*(S) < (12001og(l/e))/(5). 

Our goal is to show that, with probability 1 — 5 over the training examples, we have Pr [£ ] < e. Clearly 

Pr[£] = Pr[ £ A SeT ] + Pr [ £ A SeZ ]. 

We will separately analyze these two probabilities. 

First we analyze the non- zeros of /*. So assume that S € V, which implies that f*(S) > 1 by our 
hypothesis. Then S % U (by Eq. ( |4.2| )), and hence f(S) = 1 by the definition of /. Therefore the event 
£ A S£V can only occur when/* (S) > 1200 log(l/e). By our assumed implication we have E [/* (5) ] < 
500 log(l/ e), so we can apply Lemma|2] This shows that 

Pr[£ A S&V] < Pr[/*(5) > 1200 log(l/e) ] < e. 

It remains to analyze the zeros of /*. Assume that S G Z, i.e., f*(S) = 0. Since our hypothesis has 
f(S) = for all SeC, the event £ A SeZ holds only if S £ Z \ C The proof now follows from Claimg] 
■ 

Claim 4. With probability at least 1 — 5, the set Z\C has measure at most e. 

Proof. The idea of the proof is as follows. At any stage of the algorithm, we can compute the set U and 
the subcube C = {T : T C U }. We refer to C as the algorithm's null subcube. Suppose that there is 
at least an e chance that a new example is a zero of /*, but does not lie in the null subcube. Then such a 
example should be seen in the next sequence of log(l/<5)/e examples, with probability at least 1 — 5. This 
new example increases the dimension of the null subcube by at least one, and therefore this can happen at 
most n times. 

Formally, for k < t, define 

U k = (j Si and C k = { S : S C U k } . 

i<k 

/*(sy=o 

As argued above, we have C k C Z for any k. Suppose that, for some k, the set Z\C k has measure at least 
e. Define k' = k + log(n/5)/e. Then amongst the subsequent examples S k+ i, . . . , S k >, the probability that 
none of them lie in Z \ C k is at most 

(1 _ £ )log(n/<5)A < b j n _ 

On the other hand, if one of them does lie in Z \ C k , then \U k >\ > \U k \. But \U k \ < n for all k, so this can 
happen at most n times. Since I > n log(re/<5) / e, with probability at least 5 the final set Z\Ci has measure 
at most e. ■ 

The class F defined in Theorem[7]contains the class of matroid rank functions. We remark that Theorem[7] 
can be easily modified to handle the case where the minimum non-zero value for functions in F is r] < 1. 
To do this, we simply modify Step 2 of the algorithm to output f(A) = rj for all A <2 U. The same proof 
shows that this modified algorithm has an approximation factor of 0(log(l /e)/rj). 

4.3 Inapproximability under Arbitrary Distributions 

The simplicity of Algorithm [T] might raise one's hopes that a constant-factor approximation is possible 
under arbitrary distributions. However, we show in this section that no such approximation is possible. In 
particular, we show that no algorithm can PMAC-learn the class of non-negative, monotone, submodular 
functions with approximation factor o(n 1,/3 /logn). 
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Theorem 8. Let ACQ be an arbitrary learning algorithm that uses only a polynomial number of training 
examples drawn i.i.d. from the underlying distribution. There exists a distribution D and a submodular 
target function f* such that, with probability at least 1/8 (over the draw of the training samples), the 
hypothesis function f output by ACQ does not approximate f* within a o(n 1 / 3 /log n) factor on at least a 
1/4 fraction of the examples under D. This holds even for the subclass ofmatroid rank functions. 

Proof (of Theorem [8]). To show the lower bound, we use the family of matroids from Theorem [T] in 
Section 3.1.4 whose rank functions take wildly varying values on large set of points. The high level idea 
is to show that for a super-polynomial sized set of k points in {0, l} n , and for any partition of those points 
into High and Low, we can construct a matroid where the points in HIGH have rank rhigh and the points in 
LOW have rank r\ ow , and the ratio /"high/How = ^(n 1 / 3 ). This then implies hardness for learning over the 
uniform distribution on these k points from any polynomial-sized sample, even with value queries. 

To make the proof formal, we use the probabilistic method. Assume that ACQ uses I < n c training 
examples for some constant c. To construct a hard family of submodular functions, we will apply Theorem]]] 
with k = 2* where t = c log(ra) + 3. Let A and M. be the families that are guaranteed to exist by TheoremM] 
Let the underlying distribution D on 2^ be the uniform distribution on A. (We note that D is not a product 
distribution.) Choose a matroid Mg £ M. uniformly at random and let the target function be /* = ranking- 
Clearly ACQ does not know B. 

Assume that ACQ uses a set S of £ training examples. For any A £ A that is not a training example, 
the algorithm ACQ has no information about f*(A); in particular, the conditional distribution of its value, 
given S, remains uniform in {8t, \A\}. So ACQ cannot determine its value better than randomly guessing 
between the two possible values 8t and \A\. The set of non-training examples has measure 1 — 2~ i+log '. 
Thus 



E 



f*,s 
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A~D L 
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r(A)?[f(A),— f(A)] 



1/3 



> 
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-t+logl 



> 7/16. 



Therefore, there exists /* such that 
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77 / 

r(A)?[f(A),—f(A)] >l/4 



> 1/8. 



That is there exists /* such that with probability at least 1/8 (over the draw of the training samples) we have 
that the hypothesis function / output by ACQ does not approximate /* within a o(n 1 / 3 /logn) factor on at 
least 1/4 fraction of the examples under D. ■ 

We can further show that the lower bound in Theorem [8] holds even if the algorithm is told the underlying 
distribution, even if the algorithm can query the function on inputs of its choice, and even if the queries 
are adaptive. In other words, this inapproximability still holds in the PMAC model augmented with value 
queries. Specifically: 

Theorem 9. Let ACQ be an arbitrary learning algorithm that uses only a polynomial number of training 
examples, which can be either drawn Ltd. from the underlying distribution or value queries. There exists 
a distribution D and a submodular target function f* such that, with probability at least 1/4 (over the 
draw of the training samples), the hypothesis function output by ACQ does not approximate f* within a 
o(n 1/,3 /log n) factor on at least a 1 /4 fraction of the examples under D. This holds even for the subclass of 
matroid rank functions. 

Theorem [8] is an information-theoretic hardness result. A slight modification yields Corollary [2] which is 
a complexity-theoretic hardness result. 

Corollary 2. Suppose one-way functions exist. For any constant e > 0, no algorithm can PMAC-learn the 
class of non-negative, monotone, submodular functions with approximation factor 0(n 1 ^ 3 ^ IE ), even if the 
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Algorithm 2 Algorithm for PMAC-learning the class of non-negative monotone submodular functions. 
Input: A sequence of labeled training examples S = {(Si, f*(Si)), (S2, f*(S2)), ■ ■ ■ (<§£, f*(Se))}, where 
/* is a submodular function. 

• Let = {(Ai,f*(A{)),..., (A a , f*{A a ))} be the subsequence of S with f*{A l ) ^ Vi. Let 

So = S \ S^q. Let Uq be the set of indices defined as Uq = [j i<g Si. 

f*(Si)=0 

• For each 1 < i < a, let yi be the outcome of flipping a fair {+1, — l}-valued coin, each coin flip 
independent of the others. Let Xi G M n+1 be the point defined by 

Ux(Ai)J* 2 (Ai)) (ifw = +D 

Xl ~ \(x(^),(n + l)-/* 2 (^)) (ify f = -l). 

• Find a linear separator u = (w, —z) G K n+1 , where w G K n and z G R, such that u is consistent 
with the labeled examples (xi,yi) Vi G [a], and with the additional constraint that Wj = V? G Z^o- 

1/2 

Output: The function / defined as f(S) = ij^pij^ wJ x(S) 



functions are given by polynomial-time algorithms computing their value on the support of the distribution. 

The proofs of Theorem [9] and Corollary [2] are given in Appendix |C.3| The lower bound in Corollary [2] 
gives a family of submodular functions that are hard to learn, even though the functions can be evaluated by 
polynomial-time algorithms on the support of the distribution. However we do not prove that the functions 
can be evaluated by polynomial-time algorithms at arbitrary points, and we leave it as an open question 
whether such a construction is possible. 

4.4 An 0(v / n) -approximation Algorithm 

In this section we discuss our most general upper bound for efficiently PMAC-learning the class of non- 
negative, monotone, submodular functions with with an approximation factor of 0(y/n). 
We start with a useful structural lemmas concerning submodular functions. 

Lemma 3 (Goemans et al. [32]). Let f : 2^ — > M + be a normalized, non-negative, monotone, submodular 
function. Then there exists a function f of the form f(S) = \/w T x(S) where w G such that f(S) < 
f(S)<V^f(S)forallSQ[n}. 

We now use the preceding lemma in proving our main algorithmic result. 

Theorem 10. Let T be the class of non-negative, monotone, submodular functions over X = 2 ["I There 
is an algorithm that PMAC-learns T with approximation factor \Jn + 1. That is, for any distribution D 
over X, for any e, 5 sufficiently small, with probability 1 — 5, the algorithm produces a function f that 
approximates f* within a multiplicative factor of \/n + 1 on a set of measure 1 — e with respect to D. The 
algorithm uses I = log (j^) training examples and runs in time poly(n, 1/e, 1/5). 
Proof. As in Theorem |7J because of the multiplicative error allowed by the PMAC-learning model, we will 
separately analyze the subset of the instance space where /* is zero and the subset of the instance space 
where /* is non-zero. For convenience, let us define: 

V = { S : f*(S) ^0} and Z = { S : f*(S) = } . 

The main idea of our algorithm is to reduce our learning problem to the standard problem of learning a 
binary classifier (in fact, a linear separator) from i.i.d. samples in the passive, supervised learning setting [ 54 , 
l83l with a slight twist in order to handle the points in Z. The problem of learning a linear separator in the 
passive supervised learning setting is one where the instance space is M m , the samples are independently 
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drawn from some fixed and unknown distribution D' on M m , and there is a fixed but unknown target function 
c* : M. m — > {— 1, +1} defined by c* (x) = sgn(u T x) for some vector u G M m . The examples induced by D' 
and c* are called linearly separable. 

The linear separator learning problem we reduce to is defined as follows. The instance space is IR m where 
m = n + 1 and the distribution D' is defined by the following procedure for generating a sample from it. 
Repeatedly draw a sample S C [n] from the distribution D until f*(S) ^ 0. Next, flip a fair coin. The 
sample from D' is 

(x(S),.f 2 (S)) (if the coin is heads) 
(x(5),(n + l) -/* 2 (5)) (if the coin is tails). 

The function c* defining the labels is as follows: samples for which the coin was heads are labeled +1, and 
the others are labeled —1. 

We claim that the distribution over labeled examples induced by D' and c* is linearly separable in M n+1 . 
To prove this we use Lemma [5] which says that there exists a linear function / (S) = w J x(S) sucn that 

f(S) < f* 2 (S) < n ■ f(S) for all S C [n\. 
Let u = ((n + 1/2) • w, —1) G M m . For any point x in the support of D' we have 

x = (x(S), f* 2 (S)) => n T x = (n + 1/2) • /(S) - /* 2 (5) > 

x = (x(S),(n + l)-f* 2 (S)) n T x = (n + l/2)./(5)-(n + l)-r 2 (5)<0. 

This proves the claim. Moreover, this linear function also satisfies /(5) = for every S £ Z. In particular, 
f(S) = for all S G So and moreover, 

f({j}) = wj = for every j G U D where Z/d = (J Sj. 

Our algorithm is now as follows. It first partitions the training set 5 = {(Si, f*(S\)), . . . , (5^, f*(Se))} 
into two sets 5o and 5^o> where 5o is the subsequence of S with f*(Si) = 0, and = <5 \ 5o- For 
convenience, let us denote the sequence S^o as 

= {(A 1 J*(A 1 )),...,(A a ,f*(A a ))). 

Note that a is a random variable and we can think of the sets the A,- L as drawn independently from D, 
conditioned on belonging to V. Let 

Uo = |J Si and Co = {S : S QU }. 

i<e 

f*(Si)=0 

Using S^o> the algorithm then constructs a sequence S'_^ Q = ((xi, yi), . . . , (x a , y a )) of training exam- 
ples for the binary classification problem. For each 1 < i < a, let yi be — 1 or 1, each with probability 1/2. 
If yi = +1 set Xi = (x(Ai), f* 2 {Ai)); otherwise set x, = (x(Aj), (n + 1) • f* 2 (Ai)). The last step of our 
algorithm is to solve a linear program in order to find a linear separator u = (w, —z) where w G W 1 , z G M, 
and 

• u is consistent with the labeled examples (xj, y^) for alH = 1, . . . , a, and 

• wj = for all j G £Yo- 

The output hypothesis is f(S) = ( (^ ^(^j . 

To prove correctness, note first that the linear program is feasible; this follows from our earlier discussion 
using the facts (1) S'^ is a set of labeled examples drawn from D' and labeled by c* and (2) Uo C It 
remains to show that / approximates the target on most of the points. Let y denote the set of points S G V 
such that both of the points (x(5), f* 2 (S)) and (x(S), (n + 1) • f* 2 (S)) are correctly labeled by sgn(n T x), 
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1 /2 

the linear separator found by our algorithm. It is easy to show that the function f(S) = ( ( n +\)z wT x(S)j 

approximates /* to within a factor \Jn + 1 on all the points in the set y . To see this notice that for any point 
S € y, we have 

w T x(S) - zf\g) > and w T x (S) - z(n + 1)/* 2 (S) < 

1 /2 

So, for any point in S € 3^, the function f(S) = (t^ij^' wT x(^)j approximates /* to within a factor 

Moreover, by design the function / correctly labels as all the examples in Co- To finish the proof, we 
now note two important facts: for our choice of £ = — \ Q g (^), with high probability both V \ y and 
Z\Cq have small measure. The fact that Z \ Co has small measure follows from an argument similar to the 
one in Claim[4] We now prove: 

Claim 5. If £ = =^ log then with probability at least 1 — 25, the setV\y has measure at most 2e 
under D. 

Proof. Let q = 1 — p = Pts^d [5 £ P]. If q < e then the claim is immediate, since V has measure at 
most e. So assume that q > e. Let fi = E [a] = q£. By assumption fi > 16nlog(n/5e)|. Then Chernoff 
bounds give that 

a < 8nlog(n/5e)- < exp(—nlog(n/S)q/e) < 5. 

So with probability at least 1 — 5, we have a > 8nlog(gn/<5e)|. By a standard sample complexity argu- 
ment [83 ] (which we reproduce in Theorem 25 in Appendix |A.2| ), with probability at least 1 — 5, any linear 
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separator consistent with S' will be inconsistent with the labels on a set of measure at most e/q under D'. 
In particular, this property holds for the linear separator c computed by the linear program. So for any set 
S, the conditional probability that either (x(S), f* 2 (S)) or (x(S), (n + 1) • f^S)) is incorrectly labeled, 
given that 5 G V, is at most 2e/q. Thus 

Pv[S£V A Sgy] = Pr[S £V]-Pr{S gy I S€V] < q-(2e/q), 

as required. □ 

In summary, our algorithm produces a hypothesis / that approximates /* to within a factor n + 1 on the 
set y U Cg. The complement of this set is (Z \ C^) U (V \ y), which has measure at most 3e, with probability 
at least 1-35. ■ 



Remark Our algorithm proving Theorem 10 is significantly simpler than the algorithm of Goemans et 
al. ll32l which achieves a slightly worse approximation factor in the model of approximately learning every- 
where with value queries. 

4.4.1 Extensions 

Our algorithm for learning submodular functions is quite robust and can be extended to handle more general 
scenarios, including forms of noise. In this section we discuss several such extensions. 



It is clear from the proofs of Theorem 10 that any improvements in the approximation factor for approx- 
imating submodular functions by linear functions (i.e., Lemma [3} for specific subclasses of submodular 
functions yield PMAC-learning algorithms with improved approximation factors. 

Moreover, the algorithm described for learning submodular functions in the PMAC model is quite robust 
and it can be extended to handle more general cases as well as various forms of noise. For example, we 



can extend the result in Theorem 10 to the more general case where we do not even assume that the target 
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function is submodular, but that it is within a factor a of a submodular function on every point in the 
instance space. Under this relaxed assumption we are able to achieve the approximation factor ay/n + 1. 
Specifically: 

Theorem 11. Let T be the class of non-negative, monotone, submodular functions over X = 2^' and let 

P = { / :3gGT, g(S) < f(S) < a ■ g(S) for all SQ[n]}, 

for some known a > 1. There is an algorithm that PMAC-learns T 1 with approximation factor ay/n + 1. 
The algorithm uses £ = log (%) training examples and runs in time poly(n, 1/e, 1/ 6). 
Proof. By assumption, there exists g £ F such that g(S) < f*(S) < a ■ g(S). Combining this 
with Lemma[3j we get that there exists f(S) = w J x{S) such that 

w J X (S) < f* 2 (S) < n-a 2 ■ w T X (S) for all S C [n\. 

We then apply the algorithm described in Theorem 10 with the following modifications: (1) in the second 
step if y t = +1 we set Xi = ( X (S), f* 2 {S)) and if Vi = -1 we set x { = ( X (S),a 2 (n + 1) • f*(S)); (2) we 

1 /2 

output the function f(S) = (y a i^n+i)z w ^ 'x{S)j ■ It * s then easv to show that the distribution over labeled 
examples induced by D' and c* is linearly separable in M n+1 ; in particular, u = (a 2 (n + 1/2) ■ w, — 1) € 



l n+1 defines a good linear separator. The proof then proceeds as in Theorem 10 



We can also extend the result in Theorem 10 to the agnostic case where we assume that there exists a 
submodular function that agrees with the target on all but an r\ fraction of the points; note that on the r) 
fraction of the points the target can be arbitrarily far from a submodular function. In this case we can still 
PMAC-learn with a polynomial number of samples log but using a potentially computationally 
inefficient procedure. 

Theorem 12. Let J- be the class of non-negative, monotone, submodular functions over X = 2^ n \ Let 

T' = { / : 3g € J- s.t. f(S) = g(S) on more than 1 — r/ fraction of the points } . 

There is an algorithm that PMAC-learns J 7 ' with approximation factor y/rT+T. That is, for any distribution 
D over X, for any e, 5 sufficiently small, with probability 1 — 5, the algorithm produces a function f that 
approximates f* within a multiplicative factor of y/n+l on a set of measure 1 — e — r\ with respect to D. 
The algorithm uses 0(J> log (yj)) training examples. 



Proof Sketch. The proof proceeds as in Theorem 10 The main difference is that in the new feature space 
the best linear separator has error (fraction of mistakes) rj. It is well known that even in the agnostic case 
the number of samples needed to learn a separator of error at most rj + e is log ( J|) ) (see Theorem 26 



in Appendix A.2 ). However, it is NP-hard to minimize the number of mistakes, even approximately 11371 , so 
the resulting procedure uses a polynomial number of samples, but it is computationally inefficient. ■ 



5 An Approximate Characterization of Matroid Rank Functions 

We now present an interesting structural result that is an application of the ideas in Section [42| The state- 
ment is quite surprising: matroid rank functions are very well approximated by univariate, concave func- 
tions. The proof is also based on Theorem|6] To motivate the result, consider the following easy construction 
of submodular functions, which can be found in Lovasz's survey ll65l pp. 251] 

Proposition 1. Let h : M — ^ M be concave. Then f : 2^ — > E defined by f(S) = h(\S\) is submodular. 
Surprisingly, we now show that a partial converse is true. 

Theorem 13. There is an absolute constant c > 1 such that the following is true. Let f : 2^ — > Z + be the 
rank function of a matroid with no loops, i.e., f(S) > 1 whenever S ^ 0. Fix any e > 0, sufficiently small. 
There exists a concave function h : [0, n] — > M such that, for every k G [n], and for a 1 — e fraction of the 
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setsS€{ [ l)), 

fc(fc)/(clog(l/e)) < f(S) < clog(l/e)h(k). 

The idea behind this theorem is as follows. For x G [0, n], we define h(x) to be the expected value 
of / under the product distribution which samples each element independently with probability x/n. The 



value of / under this distribution is tightly concentrated around h(x), by the results of Section 4.2 For any 
k G [n], the distribution defining h(k) is very similar to the uniform distribution on sets of size k, so / is 
also tightly concentrated under the latter distribution. So the value of / for most sets of size k is roughly 
h(k). The concavity of this function h is a consequence of submodularity of /. 

Henceforth, we will use the following notation. For p G [0, 1], let R(j>) Q [n] denote the random variable 
obtained by choosing each element of [n] independently with probability p. For k G [n], let S(k) C [n] 
denote a set of cardinality k chosen uniformly at random. Define the function h' : [0, 1] — > E by 

h'(p) = E[f(R(p))]. 

For any t£|, define the functions g T : [0, 1] — > M and g' T : [n] — > K by 

<7 r (p) = Pr[/(fl(p))>r] 

<£(fc) = Pr[f(S(k))>r}. 
Finally, let us introduce the notation X = Y to denote that random variables X and Y are identically 
distributed. 

Lemma 4. /i' is concave. 

Proof. One way to prove this is by appealing to the multilinear extension of /, which has been of great 
value in recent work |[T3l . This is the function F : [0, 1]^ — > M defined by F(y) = E [f(y)], where 
y G {0, is a random variable obtained by independently setting y« = 1 with probability yi, and y« = 
otherwise. Then h'(p) = F(p, . . . ,p). It is known |[T3l that Q y .Q y . < for all By basic calculus, this 
implies that the second derivative of h' is non-positive, and hence h! is concave. ■ 

Lemma 5. g' T is a monotone function. 

Proof. Fix k G [n — 1] arbitrarily. Pick a set S = S(k). Construct a new set T by adding to S a 
uniformly chosen element of V \ S. By monotonicity of / we have f(S) > r f(T) > r. Thus 

Pr [f(S) >t]< Pr[/(T) > r]. Since T S S^A; + 1), this implies that g T {k) < g T {k + 1), as required. ■ 

Lemma 6. 5^ (A;) < 2 • g T {k/n), for all r G M am/ A; G [n]. 

Proof. This lemma is reminiscent of a well-known property of the Poisson approximation [68, Theorem 
5.10], and the proof is also similar. Let p = k/n. Then 

g T {p) = Pr[f(R(p))>r} 



J2Pv[f(R{p))>r I \R(p)\=i]-Pr[\R(p)\ 

i=0 

XX(i)-Pr[|lZ(p)|=i] 



8=0 



> ^^(fc) -Pr[ I fl(p) I =t] (byLemmag 

i=k 

= g' T (k)-Pr[\R(p)\>k] 

> 9 T (k)/2, 

since the mean k of the binomial distribution B(n, k/n) is also a median. 
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Proof (of Theorem 13 1. For x £ [0, n], define h(x) = h'(x/n) = E [ f{R{x/n)) }. Fix k G [n] arbitrarily. 



Case 1. Suppose that h(k) > 400 log(l/e). As argued in Eq. (|4.1[), 



Pr 



f(R{k/n)) < h(k) 



< e 



and 



Pr 



f(R(k/n)) > -h(k) 



< e. 



By Lemmarol Pr [ f(S(k)) > ^h(k) ] < 2e. By a symmetric argument, which we omit, one can show that 
Pr [f(S(k))< lh{k)] < 2e. Thus, 

Pr[ lh(k) < f{S(k)) < \h{k) } > l-4e. 

This completes the proof of Case 1. 



Case 2. Suppose that h(k) < 4001og(l/e). This immediately implies that 

h{k) 



Pr 



f(S(k)) < 



< Pv[f(S(k))<l[ 



o. 



(5.1) 



4001og(l/e) 

since k > 1, and since we assume that f(S) > 1 whenever S ^ 0. These same assumptions lead to the 
following lower bound on h: 

h(k) > Pi[f(R(k/n)) > 1] = Pv[R(k/n) ^0] > 1 - 1/e. (5.2) 

Thus 

Pr[/(5(fc)) > (20001og(l/e))/»(fc)] 

< 2-Pr [/(E(fc/ra)) > (20001og(l/e))/i(A;)] (by Lemma [6]) 

< 2 -PT[f(R(k/n)) > 12001og(l/e)] (byEq.Q) 

< 2-e, 

which can be proven using the concentration result in Corollary [T] Thus, 

h(k) 



Pr 

400 log(l/e) 
completing the proof of Case 2. 



< 



f(S(k)) < (20001og(l/e))/i(fc) 



> 1 - 2e, 



6 Implications of our Matroid Construction for Submodular Optimization 

The motivation of our matroid construction in Section |4.3| is to show hardness of learning in the PMAC 
model. Our construction has implications beyond learning theory; it reveals interesting structure of ma- 
troids and submodular functions. We now illustrate this interesting structure by using it to show strong 
inapproximability results for several submodular optimization problems. 

6.1 Submodular Minimization under a Cardinality Constraint 

Minimizing a submodular function is a fundamental problem in combinatorial optimization. Formally, the 
problem is 

min{/(5) : SQ [n] } . (6.1) 

There exist efficient algorithms to solve this problem exactly 11341143117511 . 
Theorem 14. Let f : 2™ — )■ R be any submodular function. 

(a) There is an algorithm with running time poly(n) that computes the minimum value of ( |6.1| ). 

(b) There is an algorithm with running time poly(n) that constructs a lattice which represents all minimiz- 

es of ( |6.1| ). This lattice can be represented in space poly(n.). 
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The survey of McCormick [66, Section 5.1] contains further discussion about algorithms to construct 
the lattice of minimizers. This lattice efficiently encodes a lot of information about the minimizers. For 
example, given any set S C [n], one can use the lattice to efficiently determine whether S is a minimizer of 
( |6.1[ ). Also, the lattice can be used to efficiently find the inclusionwise-minimal and inclusionwise-maximal 



minimizer of ( |6.1| ). In summary, submodular function minimization is a very tractable optimization problem, 
and its minimizers have a rich combinatorial structure. 

The submodular function minimization problem becomes much harder when we impose some simple 
constraints. In this section we consider submodular function minimization under a cardinality constraint: 

min{/(5) : 5C [n],\S\ >d}. (6.2) 

This problem, which was considered in previous work [79 ], is a minimization variant of submodular function 
maximization under a cardinality constraint [30], and is a submodular analog of the minimum coverage 
problem [84]. Unfortunately, ( |6.2| ) is not a tractable optimization problem. We show that, in a strong sense, 
its minimizers are very unstructured. 

The main result of this section is that the minimizers of (62_ I do not have a succinct, approximate repre- 
sentation. 

Theorem 15. There exists a randomly chosen non-negative, monotone, submodular function f : 2™ — > R 
such that, for any algorithm that performs any number of queries to f and outputs a data structure of size 
poly(n), that data structure cannot represent the minimizers of ( |6.2[ ) to within an approximation factor 
o(n 1//3 / log n). Moreover, any algorithm that performs poly(n) queries to f cannot compute the minimum 
value of ( |6.2[ ) to within a o(n 1 / 3 / log n) factor. 

Here, a "data structure representing the minimizers to within a factor a" is a program of size poly(ra) 
that, given a set S, returns "yes" if S is a minimizer, returns "no" if f(S) is at least a times larger than the 
minimum, and otherwise can return anything. 

Previous work 1133117911321 showed that there exists a randomly chosen non-negative, monotone, submod- 
ular function / : 2^ — > E such that any algorithm that performs poly(n) queries to / cannot approximate 
the minimum value of ( |6.2| ) to within a o(n 1 / 2 / log n) factor. Also, implicit in the work of Jensen and Korte 



11481 pp. 186] is the fact that no data structure of size poly(n) can exactly represent the minimizers of (6.2 1. 
In contrast, Theorem 15 is much stronger because it implies that no data structure of size poly(n) can even 
approximately represent the minimizers of (|6.2|). 



To prove Theorem 15 we require the matroid construction of Section 3.1.4 which we restate as follows. 

Theorem 16. Let n be a sufficiently large integer and let h(n) be any slowly divergent function. Define 
k = n h(n ) + 1, d = n 1 / 3 , b = 8 log k and r = d/4 log k. 

Set U = {ui, . . . , Uk} and V = {v%, . . . , v n }. Suppose that H = (U U V, E) is a (d, L, e)-lossless 
expander. We construct a family A = {A±, . . . , A^} of subsets of [n], each of size d, by setting 

Ai = {je [n] : Vj e T({ Ui }) } Vi = 1, . . . , k. (6.3) 

As before, A(J) denotes L)i£jAi. 

For every B C U there is a matroid Mg = ( [n] , I) whose rank function satisfies 

(b (ifm 6 B) 

[d (if Ul £U\B). 
Furthermore, every set S C [n] with \S\ > b has rankM^*?) > b. 



rank Ms (Ai 



Proof (of Theorem 15). Pick a subset B C U \ {u^} randomly. We now define a submodular function 
on the ground set [n]. Set L = d/2 log k and e = 1/L. We apply Theorem [5] to obtain a random bipartite 
multigraph H. With probability at least 1 — 2/k, the resulting graph H is a (d, L, e)-lossless expander, in 
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which case we can apply Theorem 16 to obtain the matroid Mg, which we emphasize does not depend on 
T({u k }). Define A* as in ( [63] ) for i = 1, . . . , k — 1. 

Suppose that we now allow ACQ to perform any number of queries to /. Since B is a random subset of 
U \ {uk}, which has cardinality n h ( n \ the probability that B can be represented in poly(n) bits is o(l). If 
B cannot be exactly represented by ACQ then, with probability 1/2, there is some set Ai whose value is not 
correctly represented. The multiplicative error in the value of Ai is d/b = o(n 1 / 3 / logn). 

Next we will argue that any algorithm ACQ performing m = poly(n) queries to / = rankM B has low 
probability of determining whether B = 0. If B = then the minimum value of (|6.2|) is d = n 1 / 3 , whereas 



if B ^ then the minimum value of ((x2) is b = 0(h(n) logn). Therefore this will establish the second 
part of the theorem. 

Suppose the algorithm ACQ queries the value of / on the sets Si,..., S m C [n]. Consider the i th 
query and suppose inductively that ranking (»5j) = rankM (Sj) for all j < i. Thus ACQ has not yet 
distinguished between the cases / = ranking and / = rankjy^- Consequently the set Si used in the i 
query is independent of A\, . . . , Ak-i- 

Let S'i be a set of size jS^'j = d obtained from Si by either adding (if \Si\ < d) or removing (if |iSj| > d) 
arbitrary elements of [n] , or setting S[ = Si if | S% \ = d. We will apply Theorem [5] again, but this time 
we make an additional observation. Since the definition of expansion does not depend on the labeling of 
the ground set, one may assume in Theorem [5] that one vertex in U, say uu, chooses its neighbors deter- 
ministically and that all remaining vertices in U choose their neighbors at random. Specifically, we will 
set 

r(K}) = { Vj : j G si } . 

The neighbors r({n,}) for i < j are not randomly rechosen; they are chosen to be the same as they were 
in the first invocation of Theorem [5j With probability at least 1 — 2/k we again obtain a (d, L, e)-lossless 



rank Me (5i) 



expander, in which case Theorem 16 shows that rankMg^) = d = \S[\. That event implies 

\\Si\ =rank M0 (£i) (if \Si\ < d) 
\d = rankM (£i) (if > d), 

and hence the inductive hypothesis holds for i as well. 

By a union bound over all m queries, the probability of distinguishing whether B = is at most 2m /k = 
o(l). " ■ 

6.2 Submodular s-t Min Cut 

Let G be an undirected graph with edge set E and n = \E\. Let s and t be distinct vertices of G. A set 
C C E is called an s-t cut if every s-t path intersects C. Let C C 2 E be the collection of all s-t cuts. The 
submodular s-t min cut problem H47II is 

min {/(C) : CeC}, (6.4) 
where / : 2 E — > R is a non-negative, monotone, submodular function. 

Theorem 17 (Jegelka and Bilmes 114710 . Any algorithm for the submodular s-t min cut problem with ap- 
proximation ratio o(n ly/3 ) must perform exponentially many queries to f. 



Modifying their result to incorporate our matroid construction in Section 4.3 we obtain the following 
theorem. 

Theorem 18. Let d = n 1 / 3 . Let G be a graph with edge set E consisting of d internally-vertex-disjoint 
s-t paths, each of length exactly n/d. Assume that f : 2 E — > M. is a non-negative, monotone, submodular 
function. For any algorithm that performs any number of queries to f and outputs a data structure of size 
poly(n), that data structure cannot represent the minimizers of ( |6.4[ ) to within an approximation factor 
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o(n 1 / 3 / logn). Moreover, any algorithm that performs poly(n) queries to f cannot compute the minimum 
value of ( |6.4[ ) to within a o(n 1 / 3 / logn) factor. 



The proof of this theorem is almost identical to the proof of Theorem 15 All that we require is a slightly 
different expander construction. 

Theorem 19. Let U = {u\, . . . , Uk} and V be disjoint vertex sets, where \V\ = n and n is a multiple of d. 
Write V as the disjoint union V = V\ U • • ■ U where each \Vi\ = n/d. 

Generate a random bipartite multigraph H with left-vertices U and right-vertices V as follows. The 
vertex Uk has exactly d neighbors in V, chosen deterministically and arbitrarily. For each vertex U£ with 
£ < k — 1, pick exactly one neighbor from each Vi, uniformly and independently at random. So each vertex 
in U has degree exactly d. 

Suppose that k > 4, L > d, d > log(fc) / e and n > 22Ld/e. Then, with probability at least 1 — 2/k, the 
multigraph H has no parallel edges and satisfies 

\T{{u})\ = d Vu G U 
|r(J)| > (1 - e) • d ■ | J| VJ C U, \J\ < L. 

Proof. The proof is nearly identical to the proof of Theorem [5] in Appendix [D] The only difference is in 
analyzing the probability of a repeat when sampling the neighbors of a set J C U with \J\ = j. First 
consider the case that € J. When sampling the neighbors T( J), an element Vi is considered a repeat if 
Vi G {vi, . . . , v i-i} or if Vi £ F({uk})- Conditioned on v\, . . . , the probability of a repeat is at most 
^jj. link J then this probability is at most jd/n. Consequently, the probability of having more than ejd 
repeats is at most 

The last inequality follows from j + d < 2L and our hypothesis n > 22Ld/e. The remainder of the proof 
is identical to the proof of Theorem|5] ■ 



Proof Sketch (of Theorem 18 1. Let V{ be the edges of the i th s-t path. The minimal s-t cuts are those 
which choose exactly one edge from each s-t path; in other words, they are the transversals of the ViS. Let 
V = V\ U ■ ■ ■ U Vd, this is also the edge set of the graph G. 



As in Theorem 15 we apply Theorem 19 and Theorem 16 to obtain a matroid Mg. Because the ex- 
pander construction of Theorem 19 ensures that each vertex ue has exactly one neighbor in each Vi, the 
corresponding set Ag is a minimal s-t cut. 

Suppose ACQ performs any number of queries to / = ranking- The set B has low probability of 
being representable in poly(n) bits, in which case there is an s-t min cut Ai whose value is not correctly 
represented with probability 1/2. The multiplicative error in the value of A{ is d/b = o(n 1//3 / logn). This 
proves the first part of the theorem. 

Similarly, any algorithm ACQ performing m = poly(n) queries to / has low probability of determining 
whether B = 0. If B = then the minimum value of (|6.4|) is d = n 1 / 3 , whereas if B ^ then the minimum 



value of (6_4 1 is b = 0{h{n) log n). This proves the second part of the theorem. 



6.3 Submodular Vertex Cover 

Let G = (V, E) be a graph with n = \V\. A set C C V is a vertex cover if every edge has at least one 
endpoint in C. Let C C 2 V be the collection of vertex covers in the graph. The submodular vertex cover 
problem Il3~ni44l is 

min{ f(S) : S G C }, (6.5) 
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where / : 2 V — > E is a non-negative, submodular function. An algorithm for this problem is said to have 
approximation ratio a if, for any function /, it returns a set S for which f(S) < a • min { f(S) : S G C }. 

Theorem 20 (Goel et al. iPTI . Iwata and Nagano Il44l0 . There is an algorithm which performs poly(ra) 
queries to f and has approximation ratio 2. 

Goel et al. only state that their algorithm is applicable for monotone, submodular functions, but the 
monotonicity restriction seems to be unnecessary. 

Theorem 21 (Goel et al. 13 IIP . For any constant e > 0, any algorithm for the submodular vertex cover 
problem with approximation ratio 2 — e must perform exponentially many queries to f. 

Modifying their result to incorporate our matroid construction in Section 4.3 we obtain the following 
theorem. 

Theorem 22. Let G = (U U V, E) be a bipartite graph. Assume that f : 2 UuV — > M. is a non-negative, 
monotone, submodular function. Let e € (0, 1/3) be a constant. For any algorithm that performs any 
number of queries to f and outputs a data structure of size poly(n), that data structure cannot represent 
the minimizers of ( |6.5| ) to within an approximation factor better than 4/3 — e. Moreover, any algorithm that 
performs poly(n) queries to f cannot compute the minimum value of (|6.4[) to within a 4/3 — e factor. 



Proof Sketch. Let G be a graph such that \U\ = \V\ = \E\ = n/2, and where the edges in E form 
a matching between U and V. The minimal vertex covers are those that contain exactly one endpoint of 
each edge in E. Set k = 2 e n l m . Let A = {A\, ■ ■ ■ , A^} be a collection of independently and uniformly 
chosen minimal vertex covers. For any i / j, E [ \Ai D Aj \ ] = n/4 and a Chernoff bound shows that 
Pr [ \Ai Pi Aj\ > (1 + e)n/4] < exp(-e 2 n/12). A union bound shows that, with high probability, \A{ n 
Aj\ < (1 + e)n/4 for all i / j. 

We now apply Lemma[8]with each bi = b = (3 + e)n/8 and d = n/2. We have 

min (hi + bj - \A { n Aj\) > 2b - (1 + e)n/4 = 2(3 + e)n/8 - (1 + e)n/4 = n/2, 
M'e[fc] 

and therefore the hypotheses of Lemma [8] are satisfied. It follows that, for any set B C A the set 

1b = { I ■ \I\<d A \lnAj\<b VAj eB} 

is the family of independent sets of a matroid. Let / = ranking be the rank function of this matroid. 

Suppose ACQ performs any number of queries to /. The set B has low probability of being representable 
in poly(n) bits, in which case there is a minimal vertex cover Ai whose value is not correctly represented 
with probability 1/2. The multiplicative error in the value of Ai is 

d n/2 4 

b ~ (3 + e)n/8 > 3 ~ £ ' 
This proves the first part of the theorem. 

Similarly, any algorithm ACQ performing m = poly(ra) queries to / has low probability of determining 
whether B = 0. If B = then the minimum value of (|6.4|) is d, whereas if B ^ then the minimum value 



of (|6.4[) is b. The multiplicative error is at least d/b, proving the second part of the theorem. 



7 Implications to Algorithmic Game Theory and Economics 

An important consequence of our matroid construction in Section [34] is that matroid rank functions do not 
have a "sketch", i.e., a concise, approximate representation. As matroid rank functions can be shown to 
satisfy the "gross substitutes" property iTTOl . our work implies that gross substitute functions do not have a 
concise, approximate representation. This provides a surprising answer to an open question in economics [|9j 
[TOl . In this section we define gross substitutes functions, briefly describe their importance in economics, 
and formally state the implications of our results for these functions. 
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Gross substitutes functions play a central role in algorithmic game theory and economics, particularly 
through their use as valuation functions in combinatorial auctions lfT8ll3"5ll72"l . Intuitively, in a gross substi- 
tutes valuation, increasing the price of certain items can not reduce the demand for items whose price has 
not changed. Formally: 

Definition 4. For price vector p G W 1 , the demand correspondence T>j(p) of valuation f is the collection 
of preferred sets at prices p, i.e., 



A function f is gross substitutes ( GS) if for any price vector q>p( i.e., for which qi > Pi^fiG [n]), and any 
AGV f (p) there exists A' G T> f (q) with A' D {i G A : pi = qi}. 

In other words, the gross substitutes property requires that all items i in some preferred set A at the old 
prices p and for which the old and new prices are equal (pi = qi) are simultaneously contained in some 
preferred set A' at the new prices q. 

Gross substitutes valuations (introduced by Kelso and Crawford Il55l0 enjoy several appealing structural 
properties whose implications been extensively studied by many researchers [9]. For example, given bidders 
with gross substitutes valuations, simple item-price ascending auctions can be used for determining the 
socially-efficient allocation. As another example, the gross substitute condition is necessary for important 
economic conclusions. For example, Gul and Stacchetti [35] and Milgrom ll67l showed that given any 
valuation that is not gross substitutes, one can specify very simple valuations for the other agents to create 
an economy in which no Walrasian equilibrium exists. 

One important unsolved question concerns the complexity of describing gross substitutes valuations. Sev- 
eral researchers have asked whether there exist a "succinct" representation for such valuations. In other 
words, can a bidder disclose the exact details of his valuation without conveying an exceptionally large 
amount of information? An implications of our work is that the answer to this question is "no", in a very 
strong sense. Our work implies that gross substitutes functions cannot represented succinctly, even approx- 
imately, and even with a large approximation factor. Formally: 

Definition 5. We say that j : 2W ^ 1+ is an a-sketchfor f : 2N R + if g can be represented in 
poly(ra) space and for every set S we have that f(S)/a < g(S) < f(S). 

As matroid rank functions can be shown to satisfy the gross substitute property |70l , our work implies 
that gross substitutes do not have a concise, approximate representation. Specifically: 

Theorem 23. Gross substitute functions do not admit o(n 1//3 /log n) sketches. 

8 Conclusions 

In this work we have used a learning theory perspective to uncover new structural properties of submodular 
functions. We have presented the first algorithms and lower bounds for learning submodular functions in a 
distributional learning setting. We also presented numerous implications of our work in algorithmic game 
theory, economics, matroid theory and combinatorial optimization. 

Regarding learnability, we presented polynomial upper and lower bounds on the approximation factor 
achievable when using only a polynomial number of examples drawn i.i.d. from an arbitrary distribution. 
We also presented a simple algorithm achieving a constant-factor approximation under product distributions. 
These results show that, with respect to product distributions, submodular functions behave in a fairly simple 
manner, whereas with respect to general distributions, submodular functions behave in a much more complex 
manner. 

We constructed a new family of matroids with interesting technical properties in order to prove our lower 
bound on PMAC-learnability. The existence of these matroids also resolves an open question in economics: 
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an immediate corollary of our construction is that gross substitutes functions have no succinct, approximate 
representation. We also used these matroids to show that the optimal solutions of various submodular 
optimization problems can have a very complicated structure. 

The PMAC model provides a new approach for analyzing the learnability of real-valued functions. This 
paper has analyzed submodular functions in the PMAC model. We believe that it will be interesting to study 
PMAC -learnability of other classes of real-valued functions. Indeed, as discussed below, subsequent work 
has already studied subadditive and XOS functions in the PMAC model. 

One technical question left open by this work is determining the precise approximation factor achievable 
for PMACTearning submodular functions — there is a gap between the 0{n l l 2 ) upper bound in Theorem|lO| 
and the ^(n 1 / 3 ) lower bound in Theorem^ We suspect that the lower bound can be improved to ^l{n 1 ^). 
If such an improved lower bound is possible, the matroids or submodular functions used in its proof are 
likely to be very interesting. 

8.1 Subsequent Work 

Following our work, Balcan et al. Q and Badanidiyuru et al. have provided further learnability results 
in the PMAC model for various classes of set functions commonly used in algorithmic game theory and 
economics. Building on our algorithmic technique, Balcan et al. give a computationally efficient algo- 
rithm for PMAC-learning subadditive functions to within a 0(^fn) factor. They also use target-dependent 
learnability result for XOS (or fractionally subadditive) functions. Their algorithms use the algorithmic 
technique that we develop in Section [44] together with new structural results for these classes of functions. 
Badanidiyuru et al. [5] consider the problem of sketching subadditive and submodular functions. They show 
that the existence of such a sketch implies that PMAC-learning to within a factor a is possible if compu- 
tational efficiency is ignored. As a consequence they obtain (computationally inefficient) algorithms for 
PMAC-learning to within a 0(y/n) factor for subadditive functions, and to within a 1 + e factor for both 
coverage functions and OXS functions. 

Regarding inapproximability, both Badanidiyuru et al. and Balcan et al. show that XOS (i.e., fractionally 
subadditive) functions do not have sketches that approximate to within a factor d{^Jn). Consequently, every 
algorithm for PMAC-learning XOS functions must have approximation factor Q(y/n). The construction 



used to prove this result is significantly simpler than our construction in Section 4.3 because XOS functions 
are a more expressive class than submodular functions. 

Motivated by problems in privacy preserving data analysis, Gupta et al. 1136! considered how to perform 
statistical queries to a data set in order to learn the answers to all statistical queries from a certain class. 
They showed that this problem can be efficiently solved when the queries are described by a submodular 
function. One of the technical pieces in their work is an algorithm to learn submodular functions under a 



product distribution. A main building block of their technique is the algorithm we provide in Section 4.2 
for learning under a product distribution, and their analysis is inspired by ours. Their formal guarantee is 
incomparable to ours, however: it is stronger in that they allow non-Lipschitz and non-monotone functions, 
but it is weaker in that they require access to the submodular function via a value oracle, and they guarantee 
only additive error (assuming the function is appropriately normalized). Moreover, their running time is 
n p°iy (i/e) whereas ours is poly (n, 1/e). 

Cheraghchi et al. ifTTl study the noise stability of submodular functions. As a consequence they obtain 
an algorithm for learning a submodular function under product distributions. Their algorithm also works 
for non- submodular and non-Lipschitz functions, and only requires access to the submodular function via 
statistical queries, though the running time is n poly ^/ e \ Their algorithm is agnostic (meaning that they do 
not assume the target function is submodular), and their performance guarantee proves that the L\ loss of 
their hypothesis is at most e more than the best error achieved by any submodular function (assuming the 
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function is appropriately normalized). 
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A Standard Facts 

A.l Submodular Functions 

Theorem 24. Given a finite universe U, let Si, S 2 , ■ ■ ■ , S n be subsets ofU. Define f : 2^ R + by 

f(A) = \U ieA Si\ for AC[n}. 

Then f is monotone and submodular. More generally, for any non-negative weight function w : U — >■ M+, 
the function f defined by 

f(A) = w (U i€A Si) for A C [n] 

is monotone and submodular. 

Lemma 7. The minimizers of any submodular function are closed under union and intersection. 
Proof. Assume that J\ and J 2 are minimizers for /. By submodularity we have 

f(Ji) + /(J 2 ) > f(Ji n J 2 ) + f(Ji u J 2 ). 

We also have 

f(Ji n J 2 ) + /(Ji u J 2 ) > f(Jx) + /(J 2 ), 

so f(Ji) = f(J 2 ) = f(Ji n J 2 ) = /(Ji U J 2 ), as desired. ■ 
A.2 Sample Complexity Results 



We state here several known sample complexity bounds that were used for proving the results in Section 4.4 
See, e.g., flUH). 

Theorem 25. Let C be a set of functions from X to {—1,1} with finite VC -dimension D > 1. Let D be 
an arbitrary, but fixed probability distribution over X and let c* be an arbitrary target function. For any e, 
5 > 0, if we draw a sample S from D of size 

m(e, S,D) = - e UD log f^j + 2 log 

then with probability 1 — 5, all hypotheses with error > e are inconsistent with the data; i.e., uniformly for 
all h £ C with err(h) > e, we have err{h) > 0. Here err(h) = Pr xr ^£> [h(x) 7^ c*(x)] is the true error of 
h and err(h) = Pr-^s [h(x) 7^ c* (x)] is the empirical error ofh. 

Theorem 26. Suppose that C is a set of functions from X to {—1, 1} with finite VC-dimension D > 1. For 
any distribution D over X, any target function ( not necessarily in C ), and any e, 5 > 0, if we draw a sample 
from D of size 

m(e, 6, D) = ^ (2D In (^) + In (^ 



then with probability at least 1 — 5, we have \err(h) — err(h)\ < efor all h G C. 
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B Proof of Theorem © 

We begin by observing that the theorem is much easier to prove in the special case^jthat / is integer-valued. 
Together with our other hypotheses on /, this implies that / must actually be a matroid rank function. 
Whenever f(S) is large, this fact can "certified" by any maximal independent subset of S. The theorem 
then follows easily from a version of Talagrand's inequality which leverages this certification property; see, 
e.g., 12 §7.7] or flgH §10.1]. 

We now prove the theorem in its full generality. We may assume that t < y/b, otherwise the theorem 
is trivial as the left-hand side of Eq. ( |3.9| ) is zero. Talagrand's inequality states: for any A C {0, 1}™ and 
y G {0,1}™ drawn from a product distribution, 

Pr[yeA}-Pv[p(A,y)>t] < exp(-i 2 /4), (B.l) 

where p is a distance function defined by 

p(A, y) = sup min > ctj. 

\\a\\ 2 =\ *--V&*i 

We will apply this inequality to the set iC2 y defined by A = j X : f(X) <b- ty/b j. 
Claim 6. For every Y C V, f(Y) > b implies p(A, Y) > t. 

Proof. Suppose to the contrary that p(A, Y) < t. By relabeling, we can write Y as Y = {1, . . . , k}. For 
i £ {0, ... , k}, let Ei = {!,..., i}. Define 

/ f(Ei) - f(Ei-i) (if ieY) 

10 (otherwise). 

Since / is monotone and 1-Lipschitz, we have < a t < 1. Thus ||a|| 2 < y/T^i < Vf( Y )> b y 
non-negativity of /. 

The definition of p and our supposition p(A, Y) < t imply that there exists Z G A with 

a t < p(A,Y) ■ \\a\\ 2 < ty/f(Y). (B.2) 

ie(Y\Z)U(Z\Y) 

We may assume that Z C Y, since Z T\Y also satisfies the desired conditions. This follows since mono- 
tonicity of / implies that a > and that A is downwards-closed. 

We will obtain a contradiction by showing that f(Y) — f(Z) < tyf f(Y). First let us order Y \ Z as 
0(1), . . . , <f)(m)), where <p(i) < 4>(j) iff i < j. Next, define F { = Z U {^(1), . . . , (j>(i)} C Y*. Note that 
£j ^ tms follows from our choice of (j), since Z C F^-i^ but we might have Z % Ej. Therefore 

m 

f(Y)-f(z) = Y,(f(^)-m-i)) 

i=i 

jEY\Z 

since Ej C F^-i^ and / is submodular) 

= zZ a i 

j&\z 

< ts/JiX) (byEq.(Q). 



6 An initial draft of our paper proved only this easier case. After learning of the similar concentration inequality by Chekuri et 



al. EH, we extended our proof to handle functions / that are not integer- valued. 
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So f(Z) > f(Y) - ty/f(Y) >b- tVb, since f(Y) > b and t < Vb. This contradicts Z £ A. □ 



This claim implies Pr[/(Y) > b] < Pr [p(A,Y) > t ], so the theorem follows from Eg. ( B. 1 



C Additional Proofs for Learning Submodular Functions 
C.l Learning Boolean Submodular Functions 

Theorem 27. The class of monotone, Boolean-valued, submodular functions is efficiently PMAC-learnable 
with approximation factor 1. 

Proof. Let / : 2^ —> {0, 1} be an arbitrary monotone, boolean, submodular function. We claim that / is 
either constant or a monotone disjunction. If /(0) = 1 then this is trivial, so assume /(0) = 0. 

Since submodularity is equivalent to the property of decreasing marginal values, and since /(0) = 0, we 
get 

f(T U {x}) - f(T) < /({*}) VT C [n],x € [n] \ T. 

If f({x}) = then this together with monotonicity implies that f(T U {x}) = f(T) for all T. On the other 
hand, if f({x}) = 1 then monotonicity implies that f(T) = 1 for all T such that x € T. Thus we have 
argued that / is a disjunction: 

Ji (ifsnx^0) 

I (otherwise) 



f(S) 



where X = { x : f{{x}) = 1 }. This proves the claim. 

It is well known that the class of disjunctions is easy to learn in the supervised learning setting ll54ll83Tl . 

■ 

Non-monotone, Boolean, submodular functions need not be disjunctions. For example, consider the 
function / where f(S) = if S 6 {0, [n]} and f(S) = 1 otherwise; it is submodular, but not a disjunction. 
However, it turns out that any submodular boolean functions is a 2-DNF. This was already known ll23l . 



and it can be proven by case analysis as in Proposition 27 It is well known that 2-DNFs are efficiently 
PAC-learnable. We summarize this discussion as follows. 

Theorem 28. The class of Boolean-valued, submodular functions is efficiently PMAC-learnable with ap- 
proximation factor 1. 

C.2 Learning under Product Distributions 

Lemma [2J Let f : 2^ — > R be a non-negative, monotone, submodular, 1-Lipschitz function. Sup- 
pose that S\,. . . ,Si are drawn from a product distribution D over 2^ n \ Let p the empirical average 
p = Yli=i f*(Si)/^> which is our estimate for E$~d [f*(S) ]■ Let e, 5 < 1/5. We have: 

(1) IfE[f*(S)} > 5001og(l/e) and I > 121og(l/$) then 

Pr [p > 4501og(l/e)] > 1-5/4. 

(2) IfE[f*(S)] > 4001og(l/e) and £ > 121og(l/5) then 

Pr[fE[r(5)]<M<|E[r(5)]] > 1-6/4. 

(3) IfB[f*(S)} < 5001og(l/e) and £ > 12 log(l/£) then 

Pr[/*(5) < 12001og(l/e)] > 1 - e. 

(4) IfE[f*(S)} < 4001og(l/e) and £ > 121og(l/5) then 

Pr [p < 4501og(l/e)] > 1-5/4. 
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Proof. (l):Let/: 2N><M 



be defined by 

f(S 1} ...,Si) 



£m). 



i=l 



It is easy to check that / is also non-negative, monotone, submodular and 1-Lipschitz. We will apply 
Corollary [ljto / with a = 1/10. Let X = (Si, . . . , S<). Note that E /(X) > 500£ > 240/a. Then 



Pr[/i < 4501og(l/e)] = Pr 

< Pr 

< 4exp 



ELm) < 450£log(l/e) 



/(X)-E /(X) 



> E 



/(X) /10 



E 



/(X) /1600) 



< 4exp(-^/4) < 45 3 < 5/4. 

(2): Let / and X be as above. Then 

Pr[{jE[/*(S)] < /i<|E[r(5)]] < Pr[|/i-E[/*(5)]| > E[/*(S)]/10; 



= Pr 
< 5/4. 



|/(X)-E /(X) 



> E 



/(X) /10 



(3) Set b = 12001og(l/e) and t = A^\og(l/e). Then b - ty/b > 1000 



so Pr 



f*(S) <b- ty/b > 1/2 by Markov's inequality. By Theorem 

2exp(-t 2 /4) < e, proving the claim. 

(4) Set b = 4501og(l/e)£ and t = 4- v /log(l/5). Then 

b-tVb = 4501og(l/e)^-4 v / log(l/5) v / 4501og(l/e)^ 
> 4251og(l/e)^ 

since 4-^/450 log(l/5) < 2hy/J. Therefore Markov's inequality implies that 



og(l/e) > 2E[/*(5)],and 
we have Pr[/*(5) > b] < 



Pr 



By Theorem^ we have Pr \Ya=i f*( S *) > b \< 20exp(-t 2 /4) < 20 • <T 4 < 5/4. ■ 
C.3 Learning Lower Bounds 

Theorem |9} Let ACQ be an arbitrary learning algorithm that uses only a polynomial number of training 
examples, which can be either drawn i.i.d. from the underlying distribution or value queries. There exists 
a distribution D and a submodular target function f* such that, with probability at least 1/4 (over the 
draw of the training samples), the hypothesis function output by ACQ does not approximate f* within a 
o(n 1/,3 /log n) factor on at least a 1 /4 fraction of the examples under D. This holds even for the subclass of 
matroid rank functions. 

Proof. First, consider a fully-deterministic learning algorithm ACQ, i.e., an algorithm that doesn't even 
sample from D, though it knows D and can use it in deterministically choosing queries. Say this algorithm 
makes q < n c queries (which could be chosen adaptively). Each query has at most n possible answers, since 
the minimum rank of any set is zero and the maximum rank is at most n. So the total number of possible 
sequences of answers is at most n q . 



EUif*(Si)<b-tVb] > 1/20 
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Now, since the algorithm is deterministic, the hypothesis it outputs at the end is uniquely determined by 
this sequence of answers. To be specific, its choice of the second query is uniquely determined by the answer 
given to the first query, its choice of the third query is uniquely determined by the answers given to the first 
two queries, and by induction, its choice of the zth query qi is uniquely determined by the answers given to 
all queries gi,...,gj_i so far. Its final hypothesis is uniquely determined by all n answers. This then implies 
that ACQ can output at most n q different hypotheses. 

We will apply Theorem [l] with k = 2* where t = clog(n) + log(lnn) + 14 (so k = n c ■ ln(n) • 2 14 > 
10000 • q • ln(ra)). Let A and M. be the families constructed by Theorem[T] Let the underlying distribution 
D on 2^ be the uniform distribution on A. (Note that D is not a product distribution.) Choose a matroid 
Ms G M. uniformly at random and let the target function be /* = rankM B - Let us fix a hypotheses h that 
ACQ might output. By Hoeffding bounds, we have: 

< e -2(.01) 2 fc = e -2 9 -ln(n) = n ~2q 

i.e., with probability at least 1 — n~ 2q , h has high approximation error on over 49% of the examples. 

By a union bound over all over all the n q hypotheses h that ACQ might output, we obtain that with 
probability at least 1/4 (over the draw of the training samples) the hypothesis function output by ACQ does 
not approximate /* within a o(n 1//3 /log n) factor on at least 1/4 fraction of the examples under D. 

The above argument is a fixed randomized strategy for the adversary that works against any deterministic 
ACQ making at most n c queries. By Yao's minimax principle, this means that, for any randomized algorithm 
making at most n c queries, there exists Mg which the algorithm does not learn well, even with arbitrary 
value queries. ■ 

Corollary |2j Suppose one-way functions exist. For any constant e > 0, no algorithm can PMAC-learn the 
class of non-negative, monotone, submodular functions with approximation factor 0{n l l^~ e ), even if the 
functions are given by polynomial-time algorithms computing their value on the support of the distribution. 



Pr 

f*,s 



Pr 



f*(A) 



n l/3 

h(A),—h(A) 



< 0.49 



Proof (of Corollary |2]). The argument follows Kearns- Valiant |T53l . We will apply Theorem [I] with k = 2* 
where t = rf. There exists a family of pseudorandom Boolean functions H t = { h y : y € {0, 1}* }, where 
each function is of the form h y : {0, 1}* — > {0, 1}. Choose an arbitrary bijection between {0, 1}* and A. 
Then each h y £ Ht corresponds to some subfamily B C A, and hence to a matroid rank function rankM B ■ 
Suppose there is a PMAC-learning algorithm for this family of functions which achieves approximation 
ratio better than n 1 / 3 /16t on a set of measure 1/2 + 1/ poly(n). Then this algorithm must be predicting the 
function h y on a set of size 1/2 + 1/ poly(n) = 1/2 + 1/ poly(i). This is impossible, since the family H t 
is pseudorandom. ■ 

D Expander Construction 

Theorem |5j Let G = (U U V,E) be a random multigraph where \U\ = k, \V\ = n, and every u G U 
has exactly d incident edges, each of which has an endpoint chosen uniformly and independently from all 
nodes in V. Suppose that k > 4, d > \og{k)/e andn > 16Ld/e. Then, with probability at least 1 — 2/k, G 
satisfies 

\T({u})\ = d WeU (DA) 
|r(J)| > (l-e)-d-\J\ VJCC/, |J|<L. (D.2) 
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The proof is an variant of the argument in Vadhan's survey [81, Theorem 4.4]. 
Proof. Fix j < L and consider any set J C U of size \J\ = j. The sampling process decides the neighbors 
T( J) by picking a sequence of jd neighbors v%, . . . , Vjd G V. An element Vj of that sequence is called a 
repeat if Vj G . . . , fj-i}. Conditioned on vx, . . . , Vj, the probability that Vj is a repeat is at most jd/n. 
The set J violates ( |D.2[ ) only if there exist more than ej<i repeats. The probability of this is at most 



K tjd) \ n ) \e) \ n 

The last inequality follows from j < L and our hypothesis n > VoLdje. So the probability that there exists 
a J C U with j = | J| that violates ( |D.2[ ) is at most 



^(1/4)-^ < k j 2~ 2ejd = 2- j( - 2ed - logk) < k~ j , 



since d > log(k)/e. Therefore the probability that any J with \J\ < L violates ( |D.2| ) is at most 

Y, k ~ j < 2 / k - 



i>i 



We have not yet guaranteed that there are no parallel edges, i.e., that ( |D.1[ ) is satisfied. To any n € U 
with |r({u})| < d, we can arbitrarily replace any parallel edges by new edges with distinct endpoints. This 
cannot decrease |T(J)| for any J, and so ( |D.2[ ) remains satisfied. ■ 



E Special Cases of the Matroid Construction 

The matroid constructions of Theorem [2] and Theorem [3] have several interesting special cases. 
E.l Partition Matroids 

We are given disjoint sets A±, . . . , and values b\,... : b^. We claim that the matroid X defined in The- 
orem|2jis a partition matroid. To see this, note that g( J) = Ylje J ^' s i nce tne A?' s 316 disjoint, so g is a 
modular function. Similarly, \I n A(J)| is a modular function of J. Thus, whenever | J| > 1, the constraint 
\I H A(J)\ < g(J) is redundant — it is implied by the constraints |J D Aj\ < bj for j G J. So we have 

X = { / : |ln A(J)| < 5 ( J) VJ C [A;] } = {I : \InAj\< bj Vj G [Jfe] } , 

which is the desired partition matroid. 

E.2 Pairwise Intersections 

We are given sets A\,...,Ak and values b\, . . . , bk- We now describe the special case of the matroid 
construction which only considers the pairwise intersections of the AiS. 

Lemma 8. Let dbe a non-negative integer such that d < minj jgny {hi + bj — \ Ai n Aj \ ) . Then 

X = { J : |/| < <Z A \I Ci Aj\ < bj Vj G [*;] } 
is the family of independent sets of a matroid. 

Proof. Note that for any pair J = {i,j}, we have g( J) = b{ + bj — \A{ Pi Aj\. Then 

d < min (bj + bj — \Aj n AA) = min o(J), 
- i,j&[k] 3U jc[fc],|j|=2 yv 7 ' 

so g is (ci, 2)-large. The lemma follows from Theorem|3] ■ 
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E.3 Paving Matroids 

A paving matroid is defined to be a matroid M = (V,I) of rank m such that every circuit has cardinality 
either m or m + 1. We wili show that every paving matroid can be derived from our matroid construction 
(Theorem[3]>. First of all, we require a structural lemma about paving matroids. 

Lemma 9. Let M = (V,T) be a paving matroid of rank m. There exists a family A = {A\, . . . , A^} C 2 V 
such that 

X = { I : \I\ < m A \I n Ai\ < m - 1 Vi } (E.la) 

|AnA,-| < m-2 Vi^j (E.lb) 

Related results can be found in Theorem 5.3.5, Problem 5.3.7 and Exercise 5.3.8 of Frank's book [28]. 
Proof. It is easy to see that there exists A satisfying Eq. ( |E.la| ), since we may simply take A to be the 
family of circuits which have size m. So let us choose a family A that satisfies Eq. ( |E.la[ ) and minimizes 
\A\. We will show that this family must satisfy Eq. ( E.lb| ). Suppose otherwise, i.e., there exist i ^ j such 
that \ Ai C\Aj\ > m - 1. 

Case 1: r(Ai U Aj) < m — 1. Then A \ {Ai, Aj} U {Ai U Af\ also satisfies Eq. ( |E.la[ ), contradicting 
minimality of |^4|. 

Case 2: r(A{ U Aj) = m. Observe that r(Ai n Aj) > m — 1 since \Ai n Aj\ > m — 1 and every set of 
size m — 1 is independent. So we have 

r(Ai\J Aj) +r{AiC\ Aj) > m + (m - 1) > (m - 1) + (m - 1) > r(A) + r(^-)- 

This contradicts submodularity of the rank function. ■ 

For any paving matroid, Lemma [9] implies that its independent sets can be written in the form 

I = { I : \I\ < m A | J n Ai\ < m - 1 Vi } , 

where | A; n Aj\ < m — 2 for each i / j. This is a special case of Theorem [3] since we may apply Lemma[8] 
with each bi = m — 1 and d = m, since 

min (bi + bj - |Aj n Aj\) > 2(m - 1) - (m - 2) = m. 

i,ie[fe] 
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